-
Notifications
You must be signed in to change notification settings - Fork 115
Merge from ci_feature_prod into ci_feature #316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…d) (#311) * Updatng release history * fixing the plugin logs for emit stream * updating log message * Remove Log Processing from fluentd configuration * Remove plugin references from base_container.data * Dilipr/fluent bit log processing (#126) * Build out_oms.so and include in docker-cimprov package * Adding fluent-bit-config file to base container * PR Feedback * Adding out_oms.conf to base_container.data * PR Feedback * Making the critical section as small as possible * PR Feedback * Fixing the newline bug for Computer, and changing containerId to Id * Dilipr/glide updates (#127) * Updating glide.* files to include lumberjack * containerID="" for pull issues * Using KubeAPI for getting image,name. Adding more logs (#129) * Using KubeAPI for getting image,name. Adding more logs * Moving log file and state file to within the omsagent container * Changing log and state paths * Dilipr/mark comments (#130) * Marks Comments + Error Handling * Drop records from files that are not in k8s format * Remove unnecessary log line' * Adding Log to the file that doesn't conform to the expected format * Rashmi/segfault latest (#132) * adding null checks in all providers * fixing type * fixing type * adding more null checks * update cjson * Adding a missed null check (#135) * reusing some variables (#136) * Rashmi/cjson delete null check (#138) * adding null check for cjson-delete * null chk * removing null check * updating log level to debug for some provider workflows (#139) * Fixing CPU Utilization and removing Fluent-bit filters (#140) Removing fluent-bit filters, CPU optimizations * Minor tweaks 1. Remove some logging 2. Added more Error Handling 3. Continue when there is an error with k8s api (#141) * Removing some logs, added more error checking, continue on kube-api error * Return FLB OK for json Marshall error, instead of RETRY * * Change FluentBit flush interval to 30 secs (from 5 secs) * Remove ContainerPerf, ContainerServiceLog,ContainerProcess (OMI workflows) for Daemonset * Container Log Telemetry * Fixing an issue with Send Init Event if Telemetry is not initialized properly, tab to whitespace in conf file * PR feedback * PR feedback * Sending an event every 5 mins(Heartbeat) (#146) * PR feedback to cleanup removed workflows * updating agent version for telemetry * updating agent version * Telemetry Updates (#149) * Telemetry Fixes 1. Added Log Generation Rate 2. Fixed parsing bugs 3. Added code to send Exceptions/errors * PR Feedback * Changes to send omsagent/omsagent-rs kubectl logs to App Insights (#159) * Changes to send omsagent/omsagent-rs kubectl logs to App Insights * PR Feedback * Rashmi/fluentd docker inventory (#160) * first stab * changes * changes * docker util changes * working tested util * input plugin and conf * changes * changes * changes * changes * changes * working containerinventory * fixing omi removal from container.conf * removing comments * file write and read * deleted containers working * changes * changes * socket timeout * deleting test files * adding log * fixing comment * appinsights changes * changes * tel changes * changes * changes * changes * changes * lib changes * changes * changes * fixes * PR comments * changes * updating the ownership * changes * changes * changes to container data * removing comment * changes * adding collection time * bug fix * env string truncation * changes for acs-engine test * Fix Telemetry Bug -- Initialize Telemetry Client after Initializing all required properties (#162) * Fix kube events memory leak due to yaml serialization for > 5k events (#163) * Setting Timeout for HTTP Client in PostDataHelper in outoms go plugin(#164) * Vishwa/perftelemetry 2 (#165) * add cpu usage telemetry for ds & rs * add cpu & memory usage telemetry for ds & rs * environment variable fix (#166) * environment variable fix * updating agent version * Fixing a bug where we were crashing due to container statuses not present when not was lost (#167) * Updating title * updating right versions for last release * Updating the break condition to look for end of response (#168) * Updating the break condition to look for end of response * changes for docker response * updating AgentVersion for telemetry * Updating readme for latest release changes * Changes - (#173) * use /var/log for state * new metric ContainerLogsAgentSideLatencyMs * new field 'timeOfComand' * Rashmi/kubenodeinventory (#174) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * Get cpuusage from usageseconds (#175) * Rashmi/kubenodeinventory (#176) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * Rashmi/kubenodeinventory (#178) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * Fixing an issue on the cpurate metric, which happens for the first time (when cache is empty) (#179) * Rashmi/kubenodeinventory (#180) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Exclude docker containers from container inventory (#181) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * Exclude pauseamd64 containers from container inventory (#182) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * Update agent version * Updating readme for the latest release * Fix indentation in kube.conf and update readme (#184) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * fixing indentation so that kube.conf contents can be used in config map in the yaml * updating readme to fix date and agent version * updating agent tag * Get Pods for current Node Only (#185) * Fix KubeAPI Calls to filter to get pods for current node * Reinstate log line * changes for container node inventory fixed type (#186) * Fix for mooncake (disable telemetry optionally) (#191) * disable telemetry option * fix a typo * CustomMetrics to ci_feature (#193) Custom Metrics changes to ci_feature * add ContainerNotRunning column to KubePodInventory * merge pr feedback: update name to ContainerStatusReason * Zero Fill for Missing Pod Phases, Change Namespace Dimension to Kubernetes namespace, as it might be confused with metrics namespace in Metrics Explorer (#194) * Zero Fill for Pod Counts by Phase * Change namespace dimension to Kubernetes namespace * No Retries for non 404 4xx errors (#196) * Update agent version for telemetry * Update readme for upcoming (ciprod01202019) release * fix readme formatting * fix formatting for readme * fix formatting for readme * fix readme * fix readme * fix agent version for telemetry * fix date in readme * update readme * Restart logs every 10MB instead of weekly (#198) * Rotate logs every 10MB instead of weekly * Removing some logging, fixed log rotation * update agent version for telemetry * update readme * Update kube.conf to use %STATE_DIR_WS% instead of hardcoded path * Fix AKSEngine Crash (#200) * hotfix * close resp.Body * remove chatty logs * membuf=5m and ignore files not updated since 5 mins * fix readme for new version * Fix the pod count in mdm agent plugin (#203) * Update readme * string freeze for out_mdm plugin * Vishwa/resourcecentric (#208) * resourceid fix (for AKS only) * fix name * Rashmi/win nodepool - PR (#206) * changes for win nodes enumeration * changes * changes * changes * node cpu metric rate changes * container cpu rate * changes * changes * changes * changes * changes * changes to include in_win_cadvisor_perf.rb file * send containerinventoryheartbeatevent * changes * cahnges for mdm metrics * changes * cahnges * changes * container states * changes * changes * changes for env variables * changes * changes * changes * changes * delete comments * changes * mutex changes * changes * changes * changes * telemetry fix for docker version * removing hardcoded values for mdm * update docker version * telemetry for windows cadvisor timeouts * exeception key update to computer * PR comments * adding os to container inventory for windows nodes (#210) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors (#211) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors * Fixing the bug, deferring telemetry changes for later * updating to lowercase compare for units (#212) * Merge from vishwa/telegraftcp to ci_feature for telegraf changes (#214) * merge from Vishwa/telegraf to Vishwa/telegraftcp for telegraf changes (#207) * add configuration for telegraf * fix for perms * fix telegraf config. * fix file location & config * update to config * fix namespace * trying different namespace and also debug=true * add placeholder for nodename * change namespace * updated config * fix uri * fix azMon settings * remove aad settings * add custom metrics regions * fix config * add support for replica-set config * fix oomkilled * Add telegraf 403 metric telemetry & non 403 trace telemetry * fix type * fix package * fix package import * fix filename * delete unused file * conf file for rs; fix 403counttotal metric for telegraf, remove host and use nodeName consistently, rename metrics * fix statefulsets * fix typo. * fix another typo. * fix telemetry * fix casing issue * fix comma issue. * disable telemetry for rs ; fix stateful set name * worksround for namespace fix * telegraf integration - v1 * telemetry changes for telegraf * telemetry & other changes * remove custom metric regions as we dont need anymore * remove un-needed files * fixes * exclude certain volumes and fix telemetry to not have computer & nodename as dimensions (redundant) * Vishwa/resourcecentric (#208) (#209) * resourceid fix (for AKS only) * fix name * near final metric shape * change from customlog to fixed type (InsightsMetrics) * fix PR feedback * fix pr feedback * Fix telemetry error for telegraf err count metric (#215) * Fix Unscheduled Pod bug, remove excess telemetry (#218) * Fix Unscheduled Pod bug, remove excess telemetry * Send Success Telemetry only once after startup for a node in a cluster for MDM Post * Sending telemetry for successful push to MDM every hour * Merge from Vishwa/promstandardmetrics into ci_feature (#220) * enable prometheus metrics collection in replica-set * fixing typos * fix config file path for replicaset * fix configuration * config changes * merge config/settings to ci_feature (#221) * updating fluentbit to use LOG_TAIL_PATH * changes * log exclusion pattern * changes * removing comments * adding enviornment varibale collection/disable * disable env var for cluster variable change * changes * toml parser changes * adding directory tomlrb * changes for container inventory * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Telemetry for config overrides * add schema version telemetry * reduce the number of api calls for namespace filtering add more telemetry for config processing move liveness probe & parser to this repo * optimize for default kube-system namespace log collection exclusion * Fix Scenario when Controller name is empty (#222) * fix ; * ContainerLog collection optimizations (#223) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * merge final changes for release from Vishwa/june2019agentrel to ci_feature (#224) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * fix a minor comment * * change flush from 5 to 10 secs based on perf findings * fix fluent bit tuning for perf run (#226) * fix fluent bit tuning for perf run * stop collecting our own partition * fix merge issue * add release notes for june release in ci_feature branch * fix title * update * fix title * Trim spaces in AKS_REGION (#233) This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Add Logs Size To Telemetry (#234) * Add Logs to telemetry * Using len instead of unsafe.Sizeof * Merge Vishwa/promcustommetrics to ci_feature (#237) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * Fix Region space error (#239) * Trim spaces in AKS_REGION This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Fix out_mdm parsing error * Removing buffer chunk size and buffer max size from fluentbit conf (#240) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * removing buffer chunk size and buffer max size from fluentbit conf * changes (#243) * Collect container last state (#235) * updating the OMS agent to also collect container last state * changed a comment * git surrounded ContainerLastStatus code in a begin/rescue block * added a lot of error checking and logging * Rashmi/fix prom telemetry (#247) * fix prom telemetry * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Merge Health Model work into ci_feature behind a feature flag Pending perf testing (#246) Merge Health to ci_feature * Fix Deserialization Bug (#249) * Fix the bug where capacity is not updated and cached value was being used (#251) * Fix the Capacity computation * fix node cpu and memory limits calculation * changes (#250) * Added new Custom Metrics Regions, fixed MDM plugin crash bug (#253) Added new regions, added handler for MDM plugin start * Add Missing Handlers (#254) * Added Missing Handlers * Return MultiEventStream.new instead of empty array (#256) * Added explicit require_relative to avoid loading errors (#258) * Adding explicit require_relative * Gangams/enable ai telemetry in mc (#252) * enable ai telemetry to configure different ikey and endpoint per cloud * Fixing null check out_mdm bug, tomlparser bug, exposing Replica Set service name as an ENV variable (#261) * Expose replica set service as an env variable * Fixing null check out_mdm bug, and tomlparser bug * Updating the env variable name to be more specific to health model * Changes for creating custom plugins with namespace settings for prometheus scraping (#262) * changes * changes * changes * changes * changes * changes * chnages * changes * telemetry changes * changes * Cherry-pick hotfix 09092019 to ci_feature (#265) * Gangams/add telemetry hybrid (#264) * add telemetry to detect the cloud, distro and kernel version * add null check since providerId optional * detect azurestack cloud * rename to KubernetesProviderID since ProviderID name already used in LA * capture workspaceCloud to the telemetry * trim the domain read from file * KubeMonAgentEvents changes to collect configuration events (#267) * changes * changes * changes * changes * changes * changes * env changes * changes * changes * changes * reverting * changes * cahnges * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * chnages * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Fix the Dupe Perf Data Issue from the DaemonSet (#266) * Dupe Perf Record Fix * PR for 1. Container Memory CPU monitor 2. Configuration for Node Conditions 3. Fixed Type Changes 4. Use Env variable, and health_forward (that handles network errors at init) 5. Unit Tests (#268) * init containers fix and other bug fixes (#269) * init container - KPI and kubeperf changes * changes * changes * changes * changes for empty array fix * changes * changes * pod inventory exception fix * nil check changes * changes * fixing typo * changes * changes * PR - feedback * remove comment * tag pass changes * changes * tagdrop changes * changes * changes * Send agg monitor signal on details change (#270) send when an agg monitor details change, but state did not change * bug fixes for error (#274) * Fix to use declaration and assignment instead of assignment (#275) * bug fixes for error * adding declaration to assignment * 1. Added telemetry (#277) 2. Configuration property changes 3. Bug fixes for a. unscheduled pods returning green 3b. Sometimes, the details hash of agg monitors are different because the order of elements inside the array is different, causing the records to be sent * Bug fix to remove unused variable (#281) * bug fixes for error * adding declaration to assignment * removing unused variable * Fix the WARN<->WARNING typo (#282) * Bug Fixes 1. telemetry send throwing exception if records not initialized 2. permissions error in on-prem clusters (#284) * Bug fixes 1. not writeable, telemetry error * Change to state_WS_dir * Fix Require relative revert (#287) * Bug Fixes for exceptions in telemetry, remove limit set check (#289) * Bug Fixes 10222019 * Initialize container_cpu_memory_records in fhmb * Added telemetry to investigate health exceptions * Set frozen_string_literal to true * Send event once per container when lookup is empty, or limit is an array * Unit Tests, Use RS and POD to determine workload * Fixed Node Condition Bug, added exception handling to return get_rs_owner_ref * Fix the bug where if a warning condition appears before fail condition, the node condition is reported as warning instead of fail. Also fix the node conditions state to consider unknown as a failure state (#292) * Fix for Nodes Aspect not showing up in draft cluster (#294) * Fix the issue where the health tree is inconsistent if a deployment is deleted (#295) * Rashmi/1 16 test (#297) * health deployment update * apps v1 changes for deployment * changes * changes to use relicasets and api groups * Fix duplicate records in container memory/cpu samples (#298) * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * fix exceptions (#306) * Merge Branch morgan into ci_feature (#308) * Fixes : 1) Disable health (for time being) - in DS & RS 2) Disable MDM (for time being) - in DS & RS 3) Merge kubeperf into kubenode & kubepod 4) Made scheduling predictable for kubenode & kubepod 5) Enable containerlog enrichment fields (timeofcommand, containername & containerimage) as a configurable setting (default = true/ON) - Also add telemetry for it 6) Filter OUT type!=Normal events for k8s events 7) AppInsights telemetry async 8) Fix double calling bug in in_win_cadvisor_perf 9) Add connect timeout (20secs) & read timeout (40 secs) for all cadvisor api calls & also for all kubernetes api server calls 10) Fix batchTime for kubepods to be one before making api server call (rather than after making the call, which will make it fluctuate based on api server latency for the call) * fix setting issue for the new enrichcontainerlog setting * fix compilation issue * fix another compilation issue * fix emit issues * fix a nil issue * fix mising tag * * Fix all input plugins for scheduling issue * Merge kubeservices with kubepodinventory (reduce RS to API server by one more) * Remove Kubelogs (not used) * Fix liveness probe * Disable enrichment by default for container logs * Move to yajl json parser across the board for docker provier code * Remove unused files * fix removed files * fix timeofcommand and remove a duplicate entry for a health file. * Rashmi/http leak fixes (#301) * changes for http connection close * close socket in ensure * adding nil check * Rashmi/http leak fixes (#303) * changes for http connection close * close socket in ensure * adding nil check * adding missing end * use yajl for events & nodes parsing. * Rashmi/http leak fixes (#304) * changes for http connection close * close socket in ensure * adding nil check * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * adding missing end * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * changes for chunking * telemetry changes * some fixes * bug fix * changing to have morgan changes only * add new line * use polltime for metrics and disable out_forward for health * enable mdm & health * few optimizations * do not remove time of command make kube.conf same as scale tested config * remove comments from container.conf * remove flush comment for ai telemetry * remove commented code lines * fix config * remove timeofcommand when enrichment==false * fix config * enable mdm filter * Rashmi/api chunk (#307) * changes * changes * refactor changes * changes * changes * changes * changes * node changes * changes * changes * changes * changes * adding open and read timeouts for api client * removing comments * updating chunk size * Update Readme * add back timeofcommand (#310)
rashmichandrashekar
approved these changes
Dec 4, 2019
vishiy
added a commit
that referenced
this pull request
Dec 4, 2019
* Updatng release history * fixing the plugin logs for emit stream * updating log message * Remove Log Processing from fluentd configuration * Remove plugin references from base_container.data * Dilipr/fluent bit log processing (#126) * Build out_oms.so and include in docker-cimprov package * Adding fluent-bit-config file to base container * PR Feedback * Adding out_oms.conf to base_container.data * PR Feedback * Making the critical section as small as possible * PR Feedback * Fixing the newline bug for Computer, and changing containerId to Id * Dilipr/glide updates (#127) * Updating glide.* files to include lumberjack * containerID="" for pull issues * Using KubeAPI for getting image,name. Adding more logs (#129) * Using KubeAPI for getting image,name. Adding more logs * Moving log file and state file to within the omsagent container * Changing log and state paths * Dilipr/mark comments (#130) * Marks Comments + Error Handling * Drop records from files that are not in k8s format * Remove unnecessary log line' * Adding Log to the file that doesn't conform to the expected format * Rashmi/segfault latest (#132) * adding null checks in all providers * fixing type * fixing type * adding more null checks * update cjson * Adding a missed null check (#135) * reusing some variables (#136) * Rashmi/cjson delete null check (#138) * adding null check for cjson-delete * null chk * removing null check * updating log level to debug for some provider workflows (#139) * Fixing CPU Utilization and removing Fluent-bit filters (#140) Removing fluent-bit filters, CPU optimizations * Minor tweaks 1. Remove some logging 2. Added more Error Handling 3. Continue when there is an error with k8s api (#141) * Removing some logs, added more error checking, continue on kube-api error * Return FLB OK for json Marshall error, instead of RETRY * * Change FluentBit flush interval to 30 secs (from 5 secs) * Remove ContainerPerf, ContainerServiceLog,ContainerProcess (OMI workflows) for Daemonset * Container Log Telemetry * Fixing an issue with Send Init Event if Telemetry is not initialized properly, tab to whitespace in conf file * PR feedback * PR feedback * Sending an event every 5 mins(Heartbeat) (#146) * PR feedback to cleanup removed workflows * updating agent version for telemetry * updating agent version * Telemetry Updates (#149) * Telemetry Fixes 1. Added Log Generation Rate 2. Fixed parsing bugs 3. Added code to send Exceptions/errors * PR Feedback * Changes to send omsagent/omsagent-rs kubectl logs to App Insights (#159) * Changes to send omsagent/omsagent-rs kubectl logs to App Insights * PR Feedback * Rashmi/fluentd docker inventory (#160) * first stab * changes * changes * docker util changes * working tested util * input plugin and conf * changes * changes * changes * changes * changes * working containerinventory * fixing omi removal from container.conf * removing comments * file write and read * deleted containers working * changes * changes * socket timeout * deleting test files * adding log * fixing comment * appinsights changes * changes * tel changes * changes * changes * changes * changes * lib changes * changes * changes * fixes * PR comments * changes * updating the ownership * changes * changes * changes to container data * removing comment * changes * adding collection time * bug fix * env string truncation * changes for acs-engine test * Fix Telemetry Bug -- Initialize Telemetry Client after Initializing all required properties (#162) * Fix kube events memory leak due to yaml serialization for > 5k events (#163) * Setting Timeout for HTTP Client in PostDataHelper in outoms go plugin(#164) * Vishwa/perftelemetry 2 (#165) * add cpu usage telemetry for ds & rs * add cpu & memory usage telemetry for ds & rs * environment variable fix (#166) * environment variable fix * updating agent version * Fixing a bug where we were crashing due to container statuses not present when not was lost (#167) * Updating title * updating right versions for last release * Updating the break condition to look for end of response (#168) * Updating the break condition to look for end of response * changes for docker response * updating AgentVersion for telemetry * Updating readme for latest release changes * Changes - (#173) * use /var/log for state * new metric ContainerLogsAgentSideLatencyMs * new field 'timeOfComand' * Rashmi/kubenodeinventory (#174) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * Get cpuusage from usageseconds (#175) * Rashmi/kubenodeinventory (#176) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * Rashmi/kubenodeinventory (#178) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * Fixing an issue on the cpurate metric, which happens for the first time (when cache is empty) (#179) * Rashmi/kubenodeinventory (#180) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Exclude docker containers from container inventory (#181) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * Exclude pauseamd64 containers from container inventory (#182) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * Update agent version * Updating readme for the latest release * Fix indentation in kube.conf and update readme (#184) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * fixing indentation so that kube.conf contents can be used in config map in the yaml * updating readme to fix date and agent version * updating agent tag * Get Pods for current Node Only (#185) * Fix KubeAPI Calls to filter to get pods for current node * Reinstate log line * changes for container node inventory fixed type (#186) * Fix for mooncake (disable telemetry optionally) (#191) * disable telemetry option * fix a typo * CustomMetrics to ci_feature (#193) Custom Metrics changes to ci_feature * add ContainerNotRunning column to KubePodInventory * merge pr feedback: update name to ContainerStatusReason * Zero Fill for Missing Pod Phases, Change Namespace Dimension to Kubernetes namespace, as it might be confused with metrics namespace in Metrics Explorer (#194) * Zero Fill for Pod Counts by Phase * Change namespace dimension to Kubernetes namespace * No Retries for non 404 4xx errors (#196) * Update agent version for telemetry * Update readme for upcoming (ciprod01202019) release * fix readme formatting * fix formatting for readme * fix formatting for readme * fix readme * fix readme * fix agent version for telemetry * fix date in readme * update readme * Restart logs every 10MB instead of weekly (#198) * Rotate logs every 10MB instead of weekly * Removing some logging, fixed log rotation * update agent version for telemetry * update readme * Update kube.conf to use %STATE_DIR_WS% instead of hardcoded path * Fix AKSEngine Crash (#200) * hotfix * close resp.Body * remove chatty logs * membuf=5m and ignore files not updated since 5 mins * fix readme for new version * Fix the pod count in mdm agent plugin (#203) * Update readme * string freeze for out_mdm plugin * Vishwa/resourcecentric (#208) * resourceid fix (for AKS only) * fix name * Rashmi/win nodepool - PR (#206) * changes for win nodes enumeration * changes * changes * changes * node cpu metric rate changes * container cpu rate * changes * changes * changes * changes * changes * changes to include in_win_cadvisor_perf.rb file * send containerinventoryheartbeatevent * changes * cahnges for mdm metrics * changes * cahnges * changes * container states * changes * changes * changes for env variables * changes * changes * changes * changes * delete comments * changes * mutex changes * changes * changes * changes * telemetry fix for docker version * removing hardcoded values for mdm * update docker version * telemetry for windows cadvisor timeouts * exeception key update to computer * PR comments * adding os to container inventory for windows nodes (#210) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors (#211) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors * Fixing the bug, deferring telemetry changes for later * updating to lowercase compare for units (#212) * Merge from vishwa/telegraftcp to ci_feature for telegraf changes (#214) * merge from Vishwa/telegraf to Vishwa/telegraftcp for telegraf changes (#207) * add configuration for telegraf * fix for perms * fix telegraf config. * fix file location & config * update to config * fix namespace * trying different namespace and also debug=true * add placeholder for nodename * change namespace * updated config * fix uri * fix azMon settings * remove aad settings * add custom metrics regions * fix config * add support for replica-set config * fix oomkilled * Add telegraf 403 metric telemetry & non 403 trace telemetry * fix type * fix package * fix package import * fix filename * delete unused file * conf file for rs; fix 403counttotal metric for telegraf, remove host and use nodeName consistently, rename metrics * fix statefulsets * fix typo. * fix another typo. * fix telemetry * fix casing issue * fix comma issue. * disable telemetry for rs ; fix stateful set name * worksround for namespace fix * telegraf integration - v1 * telemetry changes for telegraf * telemetry & other changes * remove custom metric regions as we dont need anymore * remove un-needed files * fixes * exclude certain volumes and fix telemetry to not have computer & nodename as dimensions (redundant) * Vishwa/resourcecentric (#208) (#209) * resourceid fix (for AKS only) * fix name * near final metric shape * change from customlog to fixed type (InsightsMetrics) * fix PR feedback * fix pr feedback * Fix telemetry error for telegraf err count metric (#215) * Fix Unscheduled Pod bug, remove excess telemetry (#218) * Fix Unscheduled Pod bug, remove excess telemetry * Send Success Telemetry only once after startup for a node in a cluster for MDM Post * Sending telemetry for successful push to MDM every hour * Merge from Vishwa/promstandardmetrics into ci_feature (#220) * enable prometheus metrics collection in replica-set * fixing typos * fix config file path for replicaset * fix configuration * config changes * merge config/settings to ci_feature (#221) * updating fluentbit to use LOG_TAIL_PATH * changes * log exclusion pattern * changes * removing comments * adding enviornment varibale collection/disable * disable env var for cluster variable change * changes * toml parser changes * adding directory tomlrb * changes for container inventory * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Telemetry for config overrides * add schema version telemetry * reduce the number of api calls for namespace filtering add more telemetry for config processing move liveness probe & parser to this repo * optimize for default kube-system namespace log collection exclusion * Fix Scenario when Controller name is empty (#222) * fix ; * ContainerLog collection optimizations (#223) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * merge final changes for release from Vishwa/june2019agentrel to ci_feature (#224) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * fix a minor comment * * change flush from 5 to 10 secs based on perf findings * fix fluent bit tuning for perf run (#226) * fix fluent bit tuning for perf run * stop collecting our own partition * fix merge issue * add release notes for june release in ci_feature branch * fix title * update * fix title * Trim spaces in AKS_REGION (#233) This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Add Logs Size To Telemetry (#234) * Add Logs to telemetry * Using len instead of unsafe.Sizeof * Merge Vishwa/promcustommetrics to ci_feature (#237) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * Fix Region space error (#239) * Trim spaces in AKS_REGION This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Fix out_mdm parsing error * Removing buffer chunk size and buffer max size from fluentbit conf (#240) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * removing buffer chunk size and buffer max size from fluentbit conf * changes (#243) * Collect container last state (#235) * updating the OMS agent to also collect container last state * changed a comment * git surrounded ContainerLastStatus code in a begin/rescue block * added a lot of error checking and logging * Rashmi/fix prom telemetry (#247) * fix prom telemetry * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Merge Health Model work into ci_feature behind a feature flag Pending perf testing (#246) Merge Health to ci_feature * Fix Deserialization Bug (#249) * Fix the bug where capacity is not updated and cached value was being used (#251) * Fix the Capacity computation * fix node cpu and memory limits calculation * changes (#250) * Added new Custom Metrics Regions, fixed MDM plugin crash bug (#253) Added new regions, added handler for MDM plugin start * Add Missing Handlers (#254) * Added Missing Handlers * Return MultiEventStream.new instead of empty array (#256) * Added explicit require_relative to avoid loading errors (#258) * Adding explicit require_relative * Gangams/enable ai telemetry in mc (#252) * enable ai telemetry to configure different ikey and endpoint per cloud * Fixing null check out_mdm bug, tomlparser bug, exposing Replica Set service name as an ENV variable (#261) * Expose replica set service as an env variable * Fixing null check out_mdm bug, and tomlparser bug * Updating the env variable name to be more specific to health model * Changes for creating custom plugins with namespace settings for prometheus scraping (#262) * changes * changes * changes * changes * changes * changes * chnages * changes * telemetry changes * changes * Cherry-pick hotfix 09092019 to ci_feature (#265) * Gangams/add telemetry hybrid (#264) * add telemetry to detect the cloud, distro and kernel version * add null check since providerId optional * detect azurestack cloud * rename to KubernetesProviderID since ProviderID name already used in LA * capture workspaceCloud to the telemetry * trim the domain read from file * KubeMonAgentEvents changes to collect configuration events (#267) * changes * changes * changes * changes * changes * changes * env changes * changes * changes * changes * reverting * changes * cahnges * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * chnages * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Fix the Dupe Perf Data Issue from the DaemonSet (#266) * Dupe Perf Record Fix * PR for 1. Container Memory CPU monitor 2. Configuration for Node Conditions 3. Fixed Type Changes 4. Use Env variable, and health_forward (that handles network errors at init) 5. Unit Tests (#268) * init containers fix and other bug fixes (#269) * init container - KPI and kubeperf changes * changes * changes * changes * changes for empty array fix * changes * changes * pod inventory exception fix * nil check changes * changes * fixing typo * changes * changes * PR - feedback * remove comment * tag pass changes * changes * tagdrop changes * changes * changes * Send agg monitor signal on details change (#270) send when an agg monitor details change, but state did not change * bug fixes for error (#274) * Fix to use declaration and assignment instead of assignment (#275) * bug fixes for error * adding declaration to assignment * 1. Added telemetry (#277) 2. Configuration property changes 3. Bug fixes for a. unscheduled pods returning green 3b. Sometimes, the details hash of agg monitors are different because the order of elements inside the array is different, causing the records to be sent * Bug fix to remove unused variable (#281) * bug fixes for error * adding declaration to assignment * removing unused variable * Fix the WARN<->WARNING typo (#282) * Bug Fixes 1. telemetry send throwing exception if records not initialized 2. permissions error in on-prem clusters (#284) * Bug fixes 1. not writeable, telemetry error * Change to state_WS_dir * Fix Require relative revert (#287) * Bug Fixes for exceptions in telemetry, remove limit set check (#289) * Bug Fixes 10222019 * Initialize container_cpu_memory_records in fhmb * Added telemetry to investigate health exceptions * Set frozen_string_literal to true * Send event once per container when lookup is empty, or limit is an array * Unit Tests, Use RS and POD to determine workload * Fixed Node Condition Bug, added exception handling to return get_rs_owner_ref * Fix the bug where if a warning condition appears before fail condition, the node condition is reported as warning instead of fail. Also fix the node conditions state to consider unknown as a failure state (#292) * Fix for Nodes Aspect not showing up in draft cluster (#294) * Fix the issue where the health tree is inconsistent if a deployment is deleted (#295) * Rashmi/1 16 test (#297) * health deployment update * apps v1 changes for deployment * changes * changes to use relicasets and api groups * Fix duplicate records in container memory/cpu samples (#298) * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * fix exceptions (#306) * Merge Branch morgan into ci_feature (#308) * Fixes : 1) Disable health (for time being) - in DS & RS 2) Disable MDM (for time being) - in DS & RS 3) Merge kubeperf into kubenode & kubepod 4) Made scheduling predictable for kubenode & kubepod 5) Enable containerlog enrichment fields (timeofcommand, containername & containerimage) as a configurable setting (default = true/ON) - Also add telemetry for it 6) Filter OUT type!=Normal events for k8s events 7) AppInsights telemetry async 8) Fix double calling bug in in_win_cadvisor_perf 9) Add connect timeout (20secs) & read timeout (40 secs) for all cadvisor api calls & also for all kubernetes api server calls 10) Fix batchTime for kubepods to be one before making api server call (rather than after making the call, which will make it fluctuate based on api server latency for the call) * fix setting issue for the new enrichcontainerlog setting * fix compilation issue * fix another compilation issue * fix emit issues * fix a nil issue * fix mising tag * * Fix all input plugins for scheduling issue * Merge kubeservices with kubepodinventory (reduce RS to API server by one more) * Remove Kubelogs (not used) * Fix liveness probe * Disable enrichment by default for container logs * Move to yajl json parser across the board for docker provier code * Remove unused files * fix removed files * fix timeofcommand and remove a duplicate entry for a health file. * Rashmi/http leak fixes (#301) * changes for http connection close * close socket in ensure * adding nil check * Rashmi/http leak fixes (#303) * changes for http connection close * close socket in ensure * adding nil check * adding missing end * use yajl for events & nodes parsing. * Rashmi/http leak fixes (#304) * changes for http connection close * close socket in ensure * adding nil check * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * adding missing end * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * changes for chunking * telemetry changes * some fixes * bug fix * changing to have morgan changes only * add new line * use polltime for metrics and disable out_forward for health * enable mdm & health * few optimizations * do not remove time of command make kube.conf same as scale tested config * remove comments from container.conf * remove flush comment for ai telemetry * remove commented code lines * fix config * remove timeofcommand when enrichment==false * fix config * enable mdm filter * Rashmi/api chunk (#307) * changes * changes * refactor changes * changes * changes * changes * changes * node changes * changes * changes * changes * changes * adding open and read timeouts for api client * removing comments * updating chunk size * Update Readme * add back timeofcommand (#310) * update readme for timeofcommand fix (#314) * Merge from ci_feature_prod into ci_feature (fix put back timeofcommand) (#311) (#316) * Updatng release history * fixing the plugin logs for emit stream * updating log message * Remove Log Processing from fluentd configuration * Remove plugin references from base_container.data * Dilipr/fluent bit log processing (#126) * Build out_oms.so and include in docker-cimprov package * Adding fluent-bit-config file to base container * PR Feedback * Adding out_oms.conf to base_container.data * PR Feedback * Making the critical section as small as possible * PR Feedback * Fixing the newline bug for Computer, and changing containerId to Id * Dilipr/glide updates (#127) * Updating glide.* files to include lumberjack * containerID="" for pull issues * Using KubeAPI for getting image,name. Adding more logs (#129) * Using KubeAPI for getting image,name. Adding more logs * Moving log file and state file to within the omsagent container * Changing log and state paths * Dilipr/mark comments (#130) * Marks Comments + Error Handling * Drop records from files that are not in k8s format * Remove unnecessary log line' * Adding Log to the file that doesn't conform to the expected format * Rashmi/segfault latest (#132) * adding null checks in all providers * fixing type * fixing type * adding more null checks * update cjson * Adding a missed null check (#135) * reusing some variables (#136) * Rashmi/cjson delete null check (#138) * adding null check for cjson-delete * null chk * removing null check * updating log level to debug for some provider workflows (#139) * Fixing CPU Utilization and removing Fluent-bit filters (#140) Removing fluent-bit filters, CPU optimizations * Minor tweaks 1. Remove some logging 2. Added more Error Handling 3. Continue when there is an error with k8s api (#141) * Removing some logs, added more error checking, continue on kube-api error * Return FLB OK for json Marshall error, instead of RETRY * * Change FluentBit flush interval to 30 secs (from 5 secs) * Remove ContainerPerf, ContainerServiceLog,ContainerProcess (OMI workflows) for Daemonset * Container Log Telemetry * Fixing an issue with Send Init Event if Telemetry is not initialized properly, tab to whitespace in conf file * PR feedback * PR feedback * Sending an event every 5 mins(Heartbeat) (#146) * PR feedback to cleanup removed workflows * updating agent version for telemetry * updating agent version * Telemetry Updates (#149) * Telemetry Fixes 1. Added Log Generation Rate 2. Fixed parsing bugs 3. Added code to send Exceptions/errors * PR Feedback * Changes to send omsagent/omsagent-rs kubectl logs to App Insights (#159) * Changes to send omsagent/omsagent-rs kubectl logs to App Insights * PR Feedback * Rashmi/fluentd docker inventory (#160) * first stab * changes * changes * docker util changes * working tested util * input plugin and conf * changes * changes * changes * changes * changes * working containerinventory * fixing omi removal from container.conf * removing comments * file write and read * deleted containers working * changes * changes * socket timeout * deleting test files * adding log * fixing comment * appinsights changes * changes * tel changes * changes * changes * changes * changes * lib changes * changes * changes * fixes * PR comments * changes * updating the ownership * changes * changes * changes to container data * removing comment * changes * adding collection time * bug fix * env string truncation * changes for acs-engine test * Fix Telemetry Bug -- Initialize Telemetry Client after Initializing all required properties (#162) * Fix kube events memory leak due to yaml serialization for > 5k events (#163) * Setting Timeout for HTTP Client in PostDataHelper in outoms go plugin(#164) * Vishwa/perftelemetry 2 (#165) * add cpu usage telemetry for ds & rs * add cpu & memory usage telemetry for ds & rs * environment variable fix (#166) * environment variable fix * updating agent version * Fixing a bug where we were crashing due to container statuses not present when not was lost (#167) * Updating title * updating right versions for last release * Updating the break condition to look for end of response (#168) * Updating the break condition to look for end of response * changes for docker response * updating AgentVersion for telemetry * Updating readme for latest release changes * Changes - (#173) * use /var/log for state * new metric ContainerLogsAgentSideLatencyMs * new field 'timeOfComand' * Rashmi/kubenodeinventory (#174) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * Get cpuusage from usageseconds (#175) * Rashmi/kubenodeinventory (#176) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * Rashmi/kubenodeinventory (#178) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * Fixing an issue on the cpurate metric, which happens for the first time (when cache is empty) (#179) * Rashmi/kubenodeinventory (#180) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Exclude docker containers from container inventory (#181) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * Exclude pauseamd64 containers from container inventory (#182) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * Update agent version * Updating readme for the latest release * Fix indentation in kube.conf and update readme (#184) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * fixing indentation so that kube.conf contents can be used in config map in the yaml * updating readme to fix date and agent version * updating agent tag * Get Pods for current Node Only (#185) * Fix KubeAPI Calls to filter to get pods for current node * Reinstate log line * changes for container node inventory fixed type (#186) * Fix for mooncake (disable telemetry optionally) (#191) * disable telemetry option * fix a typo * CustomMetrics to ci_feature (#193) Custom Metrics changes to ci_feature * add ContainerNotRunning column to KubePodInventory * merge pr feedback: update name to ContainerStatusReason * Zero Fill for Missing Pod Phases, Change Namespace Dimension to Kubernetes namespace, as it might be confused with metrics namespace in Metrics Explorer (#194) * Zero Fill for Pod Counts by Phase * Change namespace dimension to Kubernetes namespace * No Retries for non 404 4xx errors (#196) * Update agent version for telemetry * Update readme for upcoming (ciprod01202019) release * fix readme formatting * fix formatting for readme * fix formatting for readme * fix readme * fix readme * fix agent version for telemetry * fix date in readme * update readme * Restart logs every 10MB instead of weekly (#198) * Rotate logs every 10MB instead of weekly * Removing some logging, fixed log rotation * update agent version for telemetry * update readme * Update kube.conf to use %STATE_DIR_WS% instead of hardcoded path * Fix AKSEngine Crash (#200) * hotfix * close resp.Body * remove chatty logs * membuf=5m and ignore files not updated since 5 mins * fix readme for new version * Fix the pod count in mdm agent plugin (#203) * Update readme * string freeze for out_mdm plugin * Vishwa/resourcecentric (#208) * resourceid fix (for AKS only) * fix name * Rashmi/win nodepool - PR (#206) * changes for win nodes enumeration * changes * changes * changes * node cpu metric rate changes * container cpu rate * changes * changes * changes * changes * changes * changes to include in_win_cadvisor_perf.rb file * send containerinventoryheartbeatevent * changes * cahnges for mdm metrics * changes * cahnges * changes * container states * changes * changes * changes for env variables * changes * changes * changes * changes * delete comments * changes * mutex changes * changes * changes * changes * telemetry fix for docker version * removing hardcoded values for mdm * update docker version * telemetry for windows cadvisor timeouts * exeception key update to computer * PR comments * adding os to container inventory for windows nodes (#210) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors (#211) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors * Fixing the bug, deferring telemetry changes for later * updating to lowercase compare for units (#212) * Merge from vishwa/telegraftcp to ci_feature for telegraf changes (#214) * merge from Vishwa/telegraf to Vishwa/telegraftcp for telegraf changes (#207) * add configuration for telegraf * fix for perms * fix telegraf config. * fix file location & config * update to config * fix namespace * trying different namespace and also debug=true * add placeholder for nodename * change namespace * updated config * fix uri * fix azMon settings * remove aad settings * add custom metrics regions * fix config * add support for replica-set config * fix oomkilled * Add telegraf 403 metric telemetry & non 403 trace telemetry * fix type * fix package * fix package import * fix filename * delete unused file * conf file for rs; fix 403counttotal metric for telegraf, remove host and use nodeName consistently, rename metrics * fix statefulsets * fix typo. * fix another typo. * fix telemetry * fix casing issue * fix comma issue. * disable telemetry for rs ; fix stateful set name * worksround for namespace fix * telegraf integration - v1 * telemetry changes for telegraf * telemetry & other changes * remove custom metric regions as we dont need anymore * remove un-needed files * fixes * exclude certain volumes and fix telemetry to not have computer & nodename as dimensions (redundant) * Vishwa/resourcecentric (#208) (#209) * resourceid fix (for AKS only) * fix name * near final metric shape * change from customlog to fixed type (InsightsMetrics) * fix PR feedback * fix pr feedback * Fix telemetry error for telegraf err count metric (#215) * Fix Unscheduled Pod bug, remove excess telemetry (#218) * Fix Unscheduled Pod bug, remove excess telemetry * Send Success Telemetry only once after startup for a node in a cluster for MDM Post * Sending telemetry for successful push to MDM every hour * Merge from Vishwa/promstandardmetrics into ci_feature (#220) * enable prometheus metrics collection in replica-set * fixing typos * fix config file path for replicaset * fix configuration * config changes * merge config/settings to ci_feature (#221) * updating fluentbit to use LOG_TAIL_PATH * changes * log exclusion pattern * changes * removing comments * adding enviornment varibale collection/disable * disable env var for cluster variable change * changes * toml parser changes * adding directory tomlrb * changes for container inventory * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Telemetry for config overrides * add schema version telemetry * reduce the number of api calls for namespace filtering add more telemetry for config processing move liveness probe & parser to this repo * optimize for default kube-system namespace log collection exclusion * Fix Scenario when Controller name is empty (#222) * fix ; * ContainerLog collection optimizations (#223) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * merge final changes for release from Vishwa/june2019agentrel to ci_feature (#224) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * fix a minor comment * * change flush from 5 to 10 secs based on perf findings * fix fluent bit tuning for perf run (#226) * fix fluent bit tuning for perf run * stop collecting our own partition * fix merge issue * add release notes for june release in ci_feature branch * fix title * update * fix title * Trim spaces in AKS_REGION (#233) This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Add Logs Size To Telemetry (#234) * Add Logs to telemetry * Using len instead of unsafe.Sizeof * Merge Vishwa/promcustommetrics to ci_feature (#237) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * Fix Region space error (#239) * Trim spaces in AKS_REGION This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Fix out_mdm parsing error * Removing buffer chunk size and buffer max size from fluentbit conf (#240) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * removing buffer chunk size and buffer max size from fluentbit conf * changes (#243) * Collect container last state (#235) * updating the OMS agent to also collect container last state * changed a comment * git surrounded ContainerLastStatus code in a begin/rescue block * added a lot of error checking and logging * Rashmi/fix prom telemetry (#247) * fix prom telemetry * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Merge Health Model work into ci_feature behind a feature flag Pending perf testing (#246) Merge Health to ci_feature * Fix Deserialization Bug (#249) * Fix the bug where capacity is not updated and cached value was being used (#251) * Fix the Capacity computation * fix node cpu and memory limits calculation * changes (#250) * Added new Custom Metrics Regions, fixed MDM plugin crash bug (#253) Added new regions, added handler for MDM plugin start * Add Missing Handlers (#254) * Added Missing Handlers * Return MultiEventStream.new instead of empty array (#256) * Added explicit require_relative to avoid loading errors (#258) * Adding explicit require_relative * Gangams/enable ai telemetry in mc (#252) * enable ai telemetry to configure different ikey and endpoint per cloud * Fixing null check out_mdm bug, tomlparser bug, exposing Replica Set service name as an ENV variable (#261) * Expose replica set service as an env variable * Fixing null check out_mdm bug, and tomlparser bug * Updating the env variable name to be more specific to health model * Changes for creating custom plugins with namespace settings for prometheus scraping (#262) * changes * changes * changes * changes * changes * changes * chnages * changes * telemetry changes * changes * Cherry-pick hotfix 09092019 to ci_feature (#265) * Gangams/add telemetry hybrid (#264) * add telemetry to detect the cloud, distro and kernel version * add null check since providerId optional * detect azurestack cloud * rename to KubernetesProviderID since ProviderID name already used in LA * capture workspaceCloud to the telemetry * trim the domain read from file * KubeMonAgentEvents changes to collect configuration events (#267) * changes * changes * changes * changes * changes * changes * env changes * changes * changes * changes * reverting * changes * cahnges * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * chnages * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Fix the Dupe Perf Data Issue from the DaemonSet (#266) * Dupe Perf Record Fix * PR for 1. Container Memory CPU monitor 2. Configuration for Node Conditions 3. Fixed Type Changes 4. Use Env variable, and health_forward (that handles network errors at init) 5. Unit Tests (#268) * init containers fix and other bug fixes (#269) * init container - KPI and kubeperf changes * changes * changes * changes * changes for empty array fix * changes * changes * pod inventory exception fix * nil check changes * changes * fixing typo * changes * changes * PR - feedback * remove comment * tag pass changes * changes * tagdrop changes * changes * changes * Send agg monitor signal on details change (#270) send when an agg monitor details change, but state did not change * bug fixes for error (#274) * Fix to use declaration and assignment instead of assignment (#275) * bug fixes for error * adding declaration to assignment * 1. Added telemetry (#277) 2. Configuration property changes 3. Bug fixes for a. unscheduled pods returning green 3b. Sometimes, the details hash of agg monitors are different because the order of elements inside the array is different, causing the records to be sent * Bug fix to remove unused variable (#281) * bug fixes for error * adding declaration to assignment * removing unused variable * Fix the WARN<->WARNING typo (#282) * Bug Fixes 1. telemetry send throwing exception if records not initialized 2. permissions error in on-prem clusters (#284) * Bug fixes 1. not writeable, telemetry error * Change to state_WS_dir * Fix Require relative revert (#287) * Bug Fixes for exceptions in telemetry, remove limit set check (#289) * Bug Fixes 10222019 * Initialize container_cpu_memory_records in fhmb * Added telemetry to investigate health exceptions * Set frozen_string_literal to true * Send event once per container when lookup is empty, or limit is an array * Unit Tests, Use RS and POD to determine workload * Fixed Node Condition Bug, added exception handling to return get_rs_owner_ref * Fix the bug where if a warning condition appears before fail condition, the node condition is reported as warning instead of fail. Also fix the node conditions state to consider unknown as a failure state (#292) * Fix for Nodes Aspect not showing up in draft cluster (#294) * Fix the issue where the health tree is inconsistent if a deployment is deleted (#295) * Rashmi/1 16 test (#297) * health deployment update * apps v1 changes for deployment * changes * changes to use relicasets and api groups * Fix duplicate records in container memory/cpu samples (#298) * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * fix exceptions (#306) * Merge Branch morgan into ci_feature (#308) * Fixes : 1) Disable health (for time being) - in DS & RS 2) Disable MDM (for time being) - in DS & RS 3) Merge kubeperf into kubenode & kubepod 4) Made scheduling predictable for kubenode & kubepod 5) Enable containerlog enrichment fields (timeofcommand, containername & containerimage) as a configurable setting (default = true/ON) - Also add telemetry for it 6) Filter OUT type!=Normal events for k8s events 7) AppInsights telemetry async 8) Fix double calling bug in in_win_cadvisor_perf 9) Add connect timeout (20secs) & read timeout (40 secs) for all cadvisor api calls & also for all kubernetes api server calls 10) Fix batchTime for kubepods to be one before making api server call (rather than after making the call, which will make it fluctuate based on api server latency for the call) * fix setting issue for the new enrichcontainerlog setting * fix compilation issue * fix another compilation issue * fix emit issues * fix a nil issue * fix mising tag * * Fix all input plugins for scheduling issue * Merge kubeservices with kubepodinventory (reduce RS to API server by one more) * Remove Kubelogs (not used) * Fix liveness probe * Disable enrichment by default for container logs * Move to yajl json parser across the board for docker provier code * Remove unused files * fix removed files * fix timeofcommand and remove a duplicate entry for a health file. * Rashmi/http leak fixes (#301) * changes for http connection close * close socket in ensure * adding nil check * Rashmi/http leak fixes (#303) * changes for http connection close * close socket in ensure * adding nil check * adding missing end * use yajl for events & nodes parsing. * Rashmi/http leak fixes (#304) * changes for http connection close * close socket in ensure * adding nil check * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * adding missing end * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * changes for chunking * telemetry changes * some fixes * bug fix * changing to have morgan changes only * add new line * use polltime for metrics and disable out_forward for health * enable mdm & health * few optimizations * do not remove time of command make kube.conf same as scale tested config * remove comments from container.conf * remove flush comment for ai telemetry * remove commented code lines * fix config * remove timeofcommand when enrichment==false * fix config * enable mdm filter * Rashmi/api chunk (#307) * changes * changes * refactor changes * changes * changes * changes * changes * node changes * changes * changes * changes * changes * adding open and read timeouts for api client * removing comments * updating chunk size * Update Readme * add back timeofcommand (#310)
rashmichandrashekar
added a commit
that referenced
this pull request
Jan 7, 2020
* Updatng release history * fixing the plugin logs for emit stream * updating log message * Remove Log Processing from fluentd configuration * Remove plugin references from base_container.data * Dilipr/fluent bit log processing (#126) * Build out_oms.so and include in docker-cimprov package * Adding fluent-bit-config file to base container * PR Feedback * Adding out_oms.conf to base_container.data * PR Feedback * Making the critical section as small as possible * PR Feedback * Fixing the newline bug for Computer, and changing containerId to Id * Dilipr/glide updates (#127) * Updating glide.* files to include lumberjack * containerID="" for pull issues * Using KubeAPI for getting image,name. Adding more logs (#129) * Using KubeAPI for getting image,name. Adding more logs * Moving log file and state file to within the omsagent container * Changing log and state paths * Dilipr/mark comments (#130) * Marks Comments + Error Handling * Drop records from files that are not in k8s format * Remove unnecessary log line' * Adding Log to the file that doesn't conform to the expected format * Rashmi/segfault latest (#132) * adding null checks in all providers * fixing type * fixing type * adding more null checks * update cjson * Adding a missed null check (#135) * reusing some variables (#136) * Rashmi/cjson delete null check (#138) * adding null check for cjson-delete * null chk * removing null check * updating log level to debug for some provider workflows (#139) * Fixing CPU Utilization and removing Fluent-bit filters (#140) Removing fluent-bit filters, CPU optimizations * Minor tweaks 1. Remove some logging 2. Added more Error Handling 3. Continue when there is an error with k8s api (#141) * Removing some logs, added more error checking, continue on kube-api error * Return FLB OK for json Marshall error, instead of RETRY * * Change FluentBit flush interval to 30 secs (from 5 secs) * Remove ContainerPerf, ContainerServiceLog,ContainerProcess (OMI workflows) for Daemonset * Container Log Telemetry * Fixing an issue with Send Init Event if Telemetry is not initialized properly, tab to whitespace in conf file * PR feedback * PR feedback * Sending an event every 5 mins(Heartbeat) (#146) * PR feedback to cleanup removed workflows * updating agent version for telemetry * updating agent version * Telemetry Updates (#149) * Telemetry Fixes 1. Added Log Generation Rate 2. Fixed parsing bugs 3. Added code to send Exceptions/errors * PR Feedback * Changes to send omsagent/omsagent-rs kubectl logs to App Insights (#159) * Changes to send omsagent/omsagent-rs kubectl logs to App Insights * PR Feedback * Rashmi/fluentd docker inventory (#160) * first stab * changes * changes * docker util changes * working tested util * input plugin and conf * changes * changes * changes * changes * changes * working containerinventory * fixing omi removal from container.conf * removing comments * file write and read * deleted containers working * changes * changes * socket timeout * deleting test files * adding log * fixing comment * appinsights changes * changes * tel changes * changes * changes * changes * changes * lib changes * changes * changes * fixes * PR comments * changes * updating the ownership * changes * changes * changes to container data * removing comment * changes * adding collection time * bug fix * env string truncation * changes for acs-engine test * Fix Telemetry Bug -- Initialize Telemetry Client after Initializing all required properties (#162) * Fix kube events memory leak due to yaml serialization for > 5k events (#163) * Setting Timeout for HTTP Client in PostDataHelper in outoms go plugin(#164) * Vishwa/perftelemetry 2 (#165) * add cpu usage telemetry for ds & rs * add cpu & memory usage telemetry for ds & rs * environment variable fix (#166) * environment variable fix * updating agent version * Fixing a bug where we were crashing due to container statuses not present when not was lost (#167) * Updating title * updating right versions for last release * Updating the break condition to look for end of response (#168) * Updating the break condition to look for end of response * changes for docker response * updating AgentVersion for telemetry * Updating readme for latest release changes * Changes - (#173) * use /var/log for state * new metric ContainerLogsAgentSideLatencyMs * new field 'timeOfComand' * Rashmi/kubenodeinventory (#174) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * Get cpuusage from usageseconds (#175) * Rashmi/kubenodeinventory (#176) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * Rashmi/kubenodeinventory (#178) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * Fixing an issue on the cpurate metric, which happens for the first time (when cache is empty) (#179) * Rashmi/kubenodeinventory (#180) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Exclude docker containers from container inventory (#181) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * Exclude pauseamd64 containers from container inventory (#182) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * Update agent version * Updating readme for the latest release * Fix indentation in kube.conf and update readme (#184) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * fixing indentation so that kube.conf contents can be used in config map in the yaml * updating readme to fix date and agent version * updating agent tag * Get Pods for current Node Only (#185) * Fix KubeAPI Calls to filter to get pods for current node * Reinstate log line * changes for container node inventory fixed type (#186) * Fix for mooncake (disable telemetry optionally) (#191) * disable telemetry option * fix a typo * CustomMetrics to ci_feature (#193) Custom Metrics changes to ci_feature * add ContainerNotRunning column to KubePodInventory * merge pr feedback: update name to ContainerStatusReason * Zero Fill for Missing Pod Phases, Change Namespace Dimension to Kubernetes namespace, as it might be confused with metrics namespace in Metrics Explorer (#194) * Zero Fill for Pod Counts by Phase * Change namespace dimension to Kubernetes namespace * No Retries for non 404 4xx errors (#196) * Update agent version for telemetry * Update readme for upcoming (ciprod01202019) release * fix readme formatting * fix formatting for readme * fix formatting for readme * fix readme * fix readme * fix agent version for telemetry * fix date in readme * update readme * Restart logs every 10MB instead of weekly (#198) * Rotate logs every 10MB instead of weekly * Removing some logging, fixed log rotation * update agent version for telemetry * update readme * Update kube.conf to use %STATE_DIR_WS% instead of hardcoded path * Fix AKSEngine Crash (#200) * hotfix * close resp.Body * remove chatty logs * membuf=5m and ignore files not updated since 5 mins * fix readme for new version * Fix the pod count in mdm agent plugin (#203) * Update readme * string freeze for out_mdm plugin * Vishwa/resourcecentric (#208) * resourceid fix (for AKS only) * fix name * Rashmi/win nodepool - PR (#206) * changes for win nodes enumeration * changes * changes * changes * node cpu metric rate changes * container cpu rate * changes * changes * changes * changes * changes * changes to include in_win_cadvisor_perf.rb file * send containerinventoryheartbeatevent * changes * cahnges for mdm metrics * changes * cahnges * changes * container states * changes * changes * changes for env variables * changes * changes * changes * changes * delete comments * changes * mutex changes * changes * changes * changes * telemetry fix for docker version * removing hardcoded values for mdm * update docker version * telemetry for windows cadvisor timeouts * exeception key update to computer * PR comments * adding os to container inventory for windows nodes (#210) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors (#211) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors * Fixing the bug, deferring telemetry changes for later * updating to lowercase compare for units (#212) * Merge from vishwa/telegraftcp to ci_feature for telegraf changes (#214) * merge from Vishwa/telegraf to Vishwa/telegraftcp for telegraf changes (#207) * add configuration for telegraf * fix for perms * fix telegraf config. * fix file location & config * update to config * fix namespace * trying different namespace and also debug=true * add placeholder for nodename * change namespace * updated config * fix uri * fix azMon settings * remove aad settings * add custom metrics regions * fix config * add support for replica-set config * fix oomkilled * Add telegraf 403 metric telemetry & non 403 trace telemetry * fix type * fix package * fix package import * fix filename * delete unused file * conf file for rs; fix 403counttotal metric for telegraf, remove host and use nodeName consistently, rename metrics * fix statefulsets * fix typo. * fix another typo. * fix telemetry * fix casing issue * fix comma issue. * disable telemetry for rs ; fix stateful set name * worksround for namespace fix * telegraf integration - v1 * telemetry changes for telegraf * telemetry & other changes * remove custom metric regions as we dont need anymore * remove un-needed files * fixes * exclude certain volumes and fix telemetry to not have computer & nodename as dimensions (redundant) * Vishwa/resourcecentric (#208) (#209) * resourceid fix (for AKS only) * fix name * near final metric shape * change from customlog to fixed type (InsightsMetrics) * fix PR feedback * fix pr feedback * Fix telemetry error for telegraf err count metric (#215) * Fix Unscheduled Pod bug, remove excess telemetry (#218) * Fix Unscheduled Pod bug, remove excess telemetry * Send Success Telemetry only once after startup for a node in a cluster for MDM Post * Sending telemetry for successful push to MDM every hour * Merge from Vishwa/promstandardmetrics into ci_feature (#220) * enable prometheus metrics collection in replica-set * fixing typos * fix config file path for replicaset * fix configuration * config changes * merge config/settings to ci_feature (#221) * updating fluentbit to use LOG_TAIL_PATH * changes * log exclusion pattern * changes * removing comments * adding enviornment varibale collection/disable * disable env var for cluster variable change * changes * toml parser changes * adding directory tomlrb * changes for container inventory * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Telemetry for config overrides * add schema version telemetry * reduce the number of api calls for namespace filtering add more telemetry for config processing move liveness probe & parser to this repo * optimize for default kube-system namespace log collection exclusion * Fix Scenario when Controller name is empty (#222) * fix ; * ContainerLog collection optimizations (#223) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * merge final changes for release from Vishwa/june2019agentrel to ci_feature (#224) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * fix a minor comment * * change flush from 5 to 10 secs based on perf findings * fix fluent bit tuning for perf run (#226) * fix fluent bit tuning for perf run * stop collecting our own partition * fix merge issue * add release notes for june release in ci_feature branch * fix title * update * fix title * Trim spaces in AKS_REGION (#233) This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Add Logs Size To Telemetry (#234) * Add Logs to telemetry * Using len instead of unsafe.Sizeof * Merge Vishwa/promcustommetrics to ci_feature (#237) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * Fix Region space error (#239) * Trim spaces in AKS_REGION This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Fix out_mdm parsing error * Removing buffer chunk size and buffer max size from fluentbit conf (#240) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * removing buffer chunk size and buffer max size from fluentbit conf * changes (#243) * Collect container last state (#235) * updating the OMS agent to also collect container last state * changed a comment * git surrounded ContainerLastStatus code in a begin/rescue block * added a lot of error checking and logging * Rashmi/fix prom telemetry (#247) * fix prom telemetry * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Merge Health Model work into ci_feature behind a feature flag Pending perf testing (#246) Merge Health to ci_feature * Fix Deserialization Bug (#249) * Fix the bug where capacity is not updated and cached value was being used (#251) * Fix the Capacity computation * fix node cpu and memory limits calculation * changes (#250) * Added new Custom Metrics Regions, fixed MDM plugin crash bug (#253) Added new regions, added handler for MDM plugin start * Add Missing Handlers (#254) * Added Missing Handlers * Return MultiEventStream.new instead of empty array (#256) * Added explicit require_relative to avoid loading errors (#258) * Adding explicit require_relative * Gangams/enable ai telemetry in mc (#252) * enable ai telemetry to configure different ikey and endpoint per cloud * Fixing null check out_mdm bug, tomlparser bug, exposing Replica Set service name as an ENV variable (#261) * Expose replica set service as an env variable * Fixing null check out_mdm bug, and tomlparser bug * Updating the env variable name to be more specific to health model * Changes for creating custom plugins with namespace settings for prometheus scraping (#262) * changes * changes * changes * changes * changes * changes * chnages * changes * telemetry changes * changes * Cherry-pick hotfix 09092019 to ci_feature (#265) * Gangams/add telemetry hybrid (#264) * add telemetry to detect the cloud, distro and kernel version * add null check since providerId optional * detect azurestack cloud * rename to KubernetesProviderID since ProviderID name already used in LA * capture workspaceCloud to the telemetry * trim the domain read from file * KubeMonAgentEvents changes to collect configuration events (#267) * changes * changes * changes * changes * changes * changes * env changes * changes * changes * changes * reverting * changes * cahnges * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * chnages * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Fix the Dupe Perf Data Issue from the DaemonSet (#266) * Dupe Perf Record Fix * PR for 1. Container Memory CPU monitor 2. Configuration for Node Conditions 3. Fixed Type Changes 4. Use Env variable, and health_forward (that handles network errors at init) 5. Unit Tests (#268) * init containers fix and other bug fixes (#269) * init container - KPI and kubeperf changes * changes * changes * changes * changes for empty array fix * changes * changes * pod inventory exception fix * nil check changes * changes * fixing typo * changes * changes * PR - feedback * remove comment * tag pass changes * changes * tagdrop changes * changes * changes * Send agg monitor signal on details change (#270) send when an agg monitor details change, but state did not change * bug fixes for error (#274) * Fix to use declaration and assignment instead of assignment (#275) * bug fixes for error * adding declaration to assignment * 1. Added telemetry (#277) 2. Configuration property changes 3. Bug fixes for a. unscheduled pods returning green 3b. Sometimes, the details hash of agg monitors are different because the order of elements inside the array is different, causing the records to be sent * Bug fix to remove unused variable (#281) * bug fixes for error * adding declaration to assignment * removing unused variable * Fix the WARN<->WARNING typo (#282) * Bug Fixes 1. telemetry send throwing exception if records not initialized 2. permissions error in on-prem clusters (#284) * Bug fixes 1. not writeable, telemetry error * Change to state_WS_dir * Fix Require relative revert (#287) * Bug Fixes for exceptions in telemetry, remove limit set check (#289) * Bug Fixes 10222019 * Initialize container_cpu_memory_records in fhmb * Added telemetry to investigate health exceptions * Set frozen_string_literal to true * Send event once per container when lookup is empty, or limit is an array * Unit Tests, Use RS and POD to determine workload * Fixed Node Condition Bug, added exception handling to return get_rs_owner_ref * Fix the bug where if a warning condition appears before fail condition, the node condition is reported as warning instead of fail. Also fix the node conditions state to consider unknown as a failure state (#292) * Fix for Nodes Aspect not showing up in draft cluster (#294) * Fix the issue where the health tree is inconsistent if a deployment is deleted (#295) * Rashmi/1 16 test (#297) * health deployment update * apps v1 changes for deployment * changes * changes to use relicasets and api groups * Fix duplicate records in container memory/cpu samples (#298) * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * fix exceptions (#306) * Merge Branch morgan into ci_feature (#308) * Fixes : 1) Disable health (for time being) - in DS & RS 2) Disable MDM (for time being) - in DS & RS 3) Merge kubeperf into kubenode & kubepod 4) Made scheduling predictable for kubenode & kubepod 5) Enable containerlog enrichment fields (timeofcommand, containername & containerimage) as a configurable setting (default = true/ON) - Also add telemetry for it 6) Filter OUT type!=Normal events for k8s events 7) AppInsights telemetry async 8) Fix double calling bug in in_win_cadvisor_perf 9) Add connect timeout (20secs) & read timeout (40 secs) for all cadvisor api calls & also for all kubernetes api server calls 10) Fix batchTime for kubepods to be one before making api server call (rather than after making the call, which will make it fluctuate based on api server latency for the call) * fix setting issue for the new enrichcontainerlog setting * fix compilation issue * fix another compilation issue * fix emit issues * fix a nil issue * fix mising tag * * Fix all input plugins for scheduling issue * Merge kubeservices with kubepodinventory (reduce RS to API server by one more) * Remove Kubelogs (not used) * Fix liveness probe * Disable enrichment by default for container logs * Move to yajl json parser across the board for docker provier code * Remove unused files * fix removed files * fix timeofcommand and remove a duplicate entry for a health file. * Rashmi/http leak fixes (#301) * changes for http connection close * close socket in ensure * adding nil check * Rashmi/http leak fixes (#303) * changes for http connection close * close socket in ensure * adding nil check * adding missing end * use yajl for events & nodes parsing. * Rashmi/http leak fixes (#304) * changes for http connection close * close socket in ensure * adding nil check * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * adding missing end * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * changes for chunking * telemetry changes * some fixes * bug fix * changing to have morgan changes only * add new line * use polltime for metrics and disable out_forward for health * enable mdm & health * few optimizations * do not remove time of command make kube.conf same as scale tested config * remove comments from container.conf * remove flush comment for ai telemetry * remove commented code lines * fix config * remove timeofcommand when enrichment==false * fix config * enable mdm filter * Rashmi/api chunk (#307) * changes * changes * refactor changes * changes * changes * changes * changes * node changes * changes * changes * changes * changes * adding open and read timeouts for api client * removing comments * updating chunk size * Update Readme * add back timeofcommand (#310) * update readme for timeofcommand fix (#314) * Merge from ci_feature_prod into ci_feature (fix put back timeofcommand) (#311) (#316) * Updatng release history * fixing the plugin logs for emit stream * updating log message * Remove Log Processing from fluentd configuration * Remove plugin references from base_container.data * Dilipr/fluent bit log processing (#126) * Build out_oms.so and include in docker-cimprov package * Adding fluent-bit-config file to base container * PR Feedback * Adding out_oms.conf to base_container.data * PR Feedback * Making the critical section as small as possible * PR Feedback * Fixing the newline bug for Computer, and changing containerId to Id * Dilipr/glide updates (#127) * Updating glide.* files to include lumberjack * containerID="" for pull issues * Using KubeAPI for getting image,name. Adding more logs (#129) * Using KubeAPI for getting image,name. Adding more logs * Moving log file and state file to within the omsagent container * Changing log and state paths * Dilipr/mark comments (#130) * Marks Comments + Error Handling * Drop records from files that are not in k8s format * Remove unnecessary log line' * Adding Log to the file that doesn't conform to the expected format * Rashmi/segfault latest (#132) * adding null checks in all providers * fixing type * fixing type * adding more null checks * update cjson * Adding a missed null check (#135) * reusing some variables (#136) * Rashmi/cjson delete null check (#138) * adding null check for cjson-delete * null chk * removing null check * updating log level to debug for some provider workflows (#139) * Fixing CPU Utilization and removing Fluent-bit filters (#140) Removing fluent-bit filters, CPU optimizations * Minor tweaks 1. Remove some logging 2. Added more Error Handling 3. Continue when there is an error with k8s api (#141) * Removing some logs, added more error checking, continue on kube-api error * Return FLB OK for json Marshall error, instead of RETRY * * Change FluentBit flush interval to 30 secs (from 5 secs) * Remove ContainerPerf, ContainerServiceLog,ContainerProcess (OMI workflows) for Daemonset * Container Log Telemetry * Fixing an issue with Send Init Event if Telemetry is not initialized properly, tab to whitespace in conf file * PR feedback * PR feedback * Sending an event every 5 mins(Heartbeat) (#146) * PR feedback to cleanup removed workflows * updating agent version for telemetry * updating agent version * Telemetry Updates (#149) * Telemetry Fixes 1. Added Log Generation Rate 2. Fixed parsing bugs 3. Added code to send Exceptions/errors * PR Feedback * Changes to send omsagent/omsagent-rs kubectl logs to App Insights (#159) * Changes to send omsagent/omsagent-rs kubectl logs to App Insights * PR Feedback * Rashmi/fluentd docker inventory (#160) * first stab * changes * changes * docker util changes * working tested util * input plugin and conf * changes * changes * changes * changes * changes * working containerinventory * fixing omi removal from container.conf * removing comments * file write and read * deleted containers working * changes * changes * socket timeout * deleting test files * adding log * fixing comment * appinsights changes * changes * tel changes * changes * changes * changes * changes * lib changes * changes * changes * fixes * PR comments * changes * updating the ownership * changes * changes * changes to container data * removing comment * changes * adding collection time * bug fix * env string truncation * changes for acs-engine test * Fix Telemetry Bug -- Initialize Telemetry Client after Initializing all required properties (#162) * Fix kube events memory leak due to yaml serialization for > 5k events (#163) * Setting Timeout for HTTP Client in PostDataHelper in outoms go plugin(#164) * Vishwa/perftelemetry 2 (#165) * add cpu usage telemetry for ds & rs * add cpu & memory usage telemetry for ds & rs * environment variable fix (#166) * environment variable fix * updating agent version * Fixing a bug where we were crashing due to container statuses not present when not was lost (#167) * Updating title * updating right versions for last release * Updating the break condition to look for end of response (#168) * Updating the break condition to look for end of response * changes for docker response * updating AgentVersion for telemetry * Updating readme for latest release changes * Changes - (#173) * use /var/log for state * new metric ContainerLogsAgentSideLatencyMs * new field 'timeOfComand' * Rashmi/kubenodeinventory (#174) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * Get cpuusage from usageseconds (#175) * Rashmi/kubenodeinventory (#176) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * Rashmi/kubenodeinventory (#178) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * Fixing an issue on the cpurate metric, which happens for the first time (when cache is empty) (#179) * Rashmi/kubenodeinventory (#180) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Exclude docker containers from container inventory (#181) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * Exclude pauseamd64 containers from container inventory (#182) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * Update agent version * Updating readme for the latest release * Fix indentation in kube.conf and update readme (#184) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * fixing indentation so that kube.conf contents can be used in config map in the yaml * updating readme to fix date and agent version * updating agent tag * Get Pods for current Node Only (#185) * Fix KubeAPI Calls to filter to get pods for current node * Reinstate log line * changes for container node inventory fixed type (#186) * Fix for mooncake (disable telemetry optionally) (#191) * disable telemetry option * fix a typo * CustomMetrics to ci_feature (#193) Custom Metrics changes to ci_feature * add ContainerNotRunning column to KubePodInventory * merge pr feedback: update name to ContainerStatusReason * Zero Fill for Missing Pod Phases, Change Namespace Dimension to Kubernetes namespace, as it might be confused with metrics namespace in Metrics Explorer (#194) * Zero Fill for Pod Counts by Phase * Change namespace dimension to Kubernetes namespace * No Retries for non 404 4xx errors (#196) * Update agent version for telemetry * Update readme for upcoming (ciprod01202019) release * fix readme formatting * fix formatting for readme * fix formatting for readme * fix readme * fix readme * fix agent version for telemetry * fix date in readme * update readme * Restart logs every 10MB instead of weekly (#198) * Rotate logs every 10MB instead of weekly * Removing some logging, fixed log rotation * update agent version for telemetry * update readme * Update kube.conf to use %STATE_DIR_WS% instead of hardcoded path * Fix AKSEngine Crash (#200) * hotfix * close resp.Body * remove chatty logs * membuf=5m and ignore files not updated since 5 mins * fix readme for new version * Fix the pod count in mdm agent plugin (#203) * Update readme * string freeze for out_mdm plugin * Vishwa/resourcecentric (#208) * resourceid fix (for AKS only) * fix name * Rashmi/win nodepool - PR (#206) * changes for win nodes enumeration * changes * changes * changes * node cpu metric rate changes * container cpu rate * changes * changes * changes * changes * changes * changes to include in_win_cadvisor_perf.rb file * send containerinventoryheartbeatevent * changes * cahnges for mdm metrics * changes * cahnges * changes * container states * changes * changes * changes for env variables * changes * changes * changes * changes * delete comments * changes * mutex changes * changes * changes * changes * telemetry fix for docker version * removing hardcoded values for mdm * update docker version * telemetry for windows cadvisor timeouts * exeception key update to computer * PR comments * adding os to container inventory for windows nodes (#210) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors (#211) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors * Fixing the bug, deferring telemetry changes for later * updating to lowercase compare for units (#212) * Merge from vishwa/telegraftcp to ci_feature for telegraf changes (#214) * merge from Vishwa/telegraf to Vishwa/telegraftcp for telegraf changes (#207) * add configuration for telegraf * fix for perms * fix telegraf config. * fix file location & config * update to config * fix namespace * trying different namespace and also debug=true * add placeholder for nodename * change namespace * updated config * fix uri * fix azMon settings * remove aad settings * add custom metrics regions * fix config * add support for replica-set config * fix oomkilled * Add telegraf 403 metric telemetry & non 403 trace telemetry * fix type * fix package * fix package import * fix filename * delete unused file * conf file for rs; fix 403counttotal metric for telegraf, remove host and use nodeName consistently, rename metrics * fix statefulsets * fix typo. * fix another typo. * fix telemetry * fix casing issue * fix comma issue. * disable telemetry for rs ; fix stateful set name * worksround for namespace fix * telegraf integration - v1 * telemetry changes for telegraf * telemetry & other changes * remove custom metric regions as we dont need anymore * remove un-needed files * fixes * exclude certain volumes and fix telemetry to not have computer & nodename as dimensions (redundant) * Vishwa/resourcecentric (#208) (#209) * resourceid fix (for AKS only) * fix name * near final metric shape * change from customlog to fixed type (InsightsMetrics) * fix PR feedback * fix pr feedback * Fix telemetry error for telegraf err count metric (#215) * Fix Unscheduled Pod bug, remove excess telemetry (#218) * Fix Unscheduled Pod bug, remove excess telemetry * Send Success Telemetry only once after startup for a node in a cluster for MDM Post * Sending telemetry for successful push to MDM every hour * Merge from Vishwa/promstandardmetrics into ci_feature (#220) * enable prometheus metrics collection in replica-set * fixing typos * fix config file path for replicaset * fix configuration * config changes * merge config/settings to ci_feature (#221) * updating fluentbit to use LOG_TAIL_PATH * changes * log exclusion pattern * changes * removing comments * adding enviornment varibale collection/disable * disable env var for cluster variable change * changes * toml parser changes * adding directory tomlrb * changes for container inventory * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Telemetry for config overrides * add schema version telemetry * reduce the number of api calls for namespace filtering add more telemetry for config processing move liveness probe & parser to this repo * optimize for default kube-system namespace log collection exclusion * Fix Scenario when Controller name is empty (#222) * fix ; * ContainerLog collection optimizations (#223) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * merge final changes for release from Vishwa/june2019agentrel to ci_feature (#224) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * fix a minor comment * * change flush from 5 to 10 secs based on perf findings * fix fluent bit tuning for perf run (#226) * fix fluent bit tuning for perf run * stop collecting our own partition * fix merge issue * add release notes for june release in ci_feature branch * fix title * update * fix title * Trim spaces in AKS_REGION (#233) This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Add Logs Size To Telemetry (#234) * Add Logs to telemetry * Using len instead of unsafe.Sizeof * Merge Vishwa/promcustommetrics to ci_feature (#237) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * Fix Region space error (#239) * Trim spaces in AKS_REGION This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Fix out_mdm parsing error * Removing buffer chunk size and buffer max size from fluentbit conf (#240) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * removing buffer chunk size and buffer max size from fluentbit conf * changes (#243) * Collect container last state (#235) * updating the OMS agent to also collect container last state * changed a comment * git surrounded ContainerLastStatus code in a begin/rescue block * added a lot of error checking and logging * Rashmi/fix prom telemetry (#247) * fix prom telemetry * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Merge Health Model work into ci_feature behind a feature flag Pending perf testing (#246) Merge Health to ci_feature * Fix Deserialization Bug (#249) * Fix the bug where capacity is not updated and cached value was being used (#251) * Fix the Capacity computation * fix node cpu and memory limits calculation * changes (#250) * Added new Custom Metrics Regions, fixed MDM plugin crash bug (#253) Added new regions, added handler for MDM plugin start * Add Missing Handlers (#254) * Added Missing Handlers * Return MultiEventStream.new instead of empty array (#256) * Added explicit require_relative to avoid loading errors (#258) * Adding explicit require_relative * Gangams/enable ai telemetry in mc (#252) * enable ai telemetry to configure different ikey and endpoint per cloud * Fixing null check out_mdm bug, tomlparser bug, exposing Replica Set service name as an ENV variable (#261) * Expose replica set service as an env variable * Fixing null check out_mdm bug, and tomlparser bug * Updating the env variable name to be more specific to health model * Changes for creating custom plugins with namespace settings for prometheus scraping (#262) * changes * changes * changes * changes * changes * changes * chnages * changes * telemetry changes * changes * Cherry-pick hotfix 09092019 to ci_feature (#265) * Gangams/add telemetry hybrid (#264) * add telemetry to detect the cloud, distro and kernel version * add null check since providerId optional * detect azurestack cloud * rename to KubernetesProviderID since ProviderID name already used in LA * capture workspaceCloud to the telemetry * trim the domain read from file * KubeMonAgentEvents changes to collect configuration events (#267) * changes * changes * changes * changes * changes * changes * env changes * changes * changes * changes * reverting * changes * cahnges * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * chnages * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Fix the Dupe Perf Data Issue from the DaemonSet (#266) * Dupe Perf Record Fix * PR for 1. Container Memory CPU monitor 2. Configuration for Node Conditions 3. Fixed Type Changes 4. Use Env variable, and health_forward (that handles network errors at init) 5. Unit Tests (#268) * init containers fix and other bug fixes (#269) * init container - KPI and kubeperf changes * changes * changes * changes * changes for empty array fix * changes * changes * pod inventory exception fix * nil check changes * changes * fixing typo * changes * changes * PR - feedback * remove comment * tag pass changes * changes * tagdrop changes * changes * changes * Send agg monitor signal on details change (#270) send when an agg monitor details change, but state did not change * bug fixes for error (#274) * Fix to use declaration and assignment instead of assignment (#275) * bug fixes for error * adding declaration to assignment * 1. Added telemetry (#277) 2. Configuration property changes 3. Bug fixes for a. unscheduled pods returning green 3b. Sometimes, the details hash of agg monitors are different because the order of elements inside the array is different, causing the records to be sent * Bug fix to remove unused variable (#281) * bug fixes for error * adding declaration to assignment * removing unused variable * Fix the WARN<->WARNING typo (#282) * Bug Fixes 1. telemetry send throwing exception if records not initialized 2. permissions error in on-prem clusters (#284) * Bug fixes 1. not writeable, telemetry error * Change to state_WS_dir * Fix Require relative revert (#287) * Bug Fixes for exceptions in telemetry, remove limit set check (#289) * Bug Fixes 10222019 * Initialize container_cpu_memory_records in fhmb * Added telemetry to investigate health exceptions * Set frozen_string_literal to true * Send event once per container when lookup is empty, or limit is an array * Unit Tests, Use RS and POD to determine workload * Fixed Node Condition Bug, added exception handling to return get_rs_owner_ref * Fix the bug where if a warning condition appears before fail condition, the node condition is reported as warning instead of fail. Also fix the node conditions state to consider unknown as a failure state (#292) * Fix for Nodes Aspect not showing up in draft cluster (#294) * Fix the issue where the health tree is inconsistent if a deployment is deleted (#295) * Rashmi/1 16 test (#297) * health deployment update * apps v1 changes for deployment * changes * changes to use relicasets and api groups * Fix duplicate records in container memory/cpu samples (#298) * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * fix exceptions (#306) * Merge Branch morgan into ci_feature (#308) * Fixes : 1) Disable health (for time being) - in DS & RS 2) Disable MDM (for time being) - in DS & RS 3) Merge kubeperf into kubenode & kubepod 4) Made scheduling predictable for kubenode & kubepod 5) Enable containerlog enrichment fields (timeofcommand, containername & containerimage) as a configurable setting (default = true/ON) - Also add telemetry for it 6) Filter OUT type!=Normal events for k8s events 7) AppInsights telemetry async 8) Fix double calling bug in in_win_cadvisor_perf 9) Add connect timeout (20secs) & read timeout (40 secs) for all cadvisor api calls & also for all kubernetes api server calls 10) Fix batchTime for kubepods to be one before making api server call (rather than after making the call, which will make it fluctuate based on api server latency for the call) * fix setting issue for the new enrichcontainerlog setting * fix compilation issue * fix another compilation issue * fix emit issues * fix a nil issue * fix mising tag * * Fix all input plugins for scheduling issue * Merge kubeservices with kubepodinventory (reduce RS to API server by one more) * Remove Kubelogs (not used) * Fix liveness probe * Disable enrichment by default for container logs * Move to yajl json parser across the board for docker provier code * Remove unused files * fix removed files * fix timeofcommand and remove a duplicate entry for a health file. * Rashmi/http leak fixes (#301) * changes for http connection close * close socket in ensure * adding nil check * Rashmi/http leak fixes (#303) * changes for http connection close * close socket in ensure * adding nil check * adding missing end * use yajl for events & nodes parsing. * Rashmi/http leak fixes (#304) * changes for http connection close * close socket in ensure * adding nil check * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * adding missing end * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * changes for chunking * telemetry changes * some fixes * bug fix * changing to have morgan changes only * add new line * use polltime for metrics and disable out_forward for health * enable mdm & health * few optimizations * do not remove time of command make kube.conf same as scale tested config * remove comments from container.conf * remove flush comment for ai telemetry * remove commented code lines * fix config * remove timeofcommand when enrichment==false * fix config * enable mdm filter * Rashmi/api chunk (#307) * changes * changes * refactor changes * changes * changes * changes * changes * node changes * changes * changes * changes * changes * adding open and read timeouts for api client * removing comments * updating chunk size * Update Readme * add back timeofcommand (#310) * Adding new cpu and memory limits to readme * CAdvisor to use 10255/10250 based on env variable (#321) * CAdvisor secure port changes (#320) * cadvsior secure port changes * update to use secure/insecure port for cadvisor * telemetry changes * fix bug * bug fix * changes * Adding cadvisor uri log * switching defaults * update readme * changes Co-authored-by: Vishwanath <visnara@microsoft.com> Co-authored-by: Dilip Raghunathan <dilip.rangarajan@gmail.com> Co-authored-by: bragi92 <kadubey@microsoft.com> Co-authored-by: David Michelman <daweim0@gmail.com> Co-authored-by: ganga1980 <gangams@microsoft.com>
rashmichandrashekar
added a commit
that referenced
this pull request
Jan 7, 2020
* Updatng release history * fixing the plugin logs for emit stream * updating log message * Remove Log Processing from fluentd configuration * Remove plugin references from base_container.data * Dilipr/fluent bit log processing (#126) * Build out_oms.so and include in docker-cimprov package * Adding fluent-bit-config file to base container * PR Feedback * Adding out_oms.conf to base_container.data * PR Feedback * Making the critical section as small as possible * PR Feedback * Fixing the newline bug for Computer, and changing containerId to Id * Dilipr/glide updates (#127) * Updating glide.* files to include lumberjack * containerID="" for pull issues * Using KubeAPI for getting image,name. Adding more logs (#129) * Using KubeAPI for getting image,name. Adding more logs * Moving log file and state file to within the omsagent container * Changing log and state paths * Dilipr/mark comments (#130) * Marks Comments + Error Handling * Drop records from files that are not in k8s format * Remove unnecessary log line' * Adding Log to the file that doesn't conform to the expected format * Rashmi/segfault latest (#132) * adding null checks in all providers * fixing type * fixing type * adding more null checks * update cjson * Adding a missed null check (#135) * reusing some variables (#136) * Rashmi/cjson delete null check (#138) * adding null check for cjson-delete * null chk * removing null check * updating log level to debug for some provider workflows (#139) * Fixing CPU Utilization and removing Fluent-bit filters (#140) Removing fluent-bit filters, CPU optimizations * Minor tweaks 1. Remove some logging 2. Added more Error Handling 3. Continue when there is an error with k8s api (#141) * Removing some logs, added more error checking, continue on kube-api error * Return FLB OK for json Marshall error, instead of RETRY * * Change FluentBit flush interval to 30 secs (from 5 secs) * Remove ContainerPerf, ContainerServiceLog,ContainerProcess (OMI workflows) for Daemonset * Container Log Telemetry * Fixing an issue with Send Init Event if Telemetry is not initialized properly, tab to whitespace in conf file * PR feedback * PR feedback * Sending an event every 5 mins(Heartbeat) (#146) * PR feedback to cleanup removed workflows * updating agent version for telemetry * updating agent version * Telemetry Updates (#149) * Telemetry Fixes 1. Added Log Generation Rate 2. Fixed parsing bugs 3. Added code to send Exceptions/errors * PR Feedback * Changes to send omsagent/omsagent-rs kubectl logs to App Insights (#159) * Changes to send omsagent/omsagent-rs kubectl logs to App Insights * PR Feedback * Rashmi/fluentd docker inventory (#160) * first stab * changes * changes * docker util changes * working tested util * input plugin and conf * changes * changes * changes * changes * changes * working containerinventory * fixing omi removal from container.conf * removing comments * file write and read * deleted containers working * changes * changes * socket timeout * deleting test files * adding log * fixing comment * appinsights changes * changes * tel changes * changes * changes * changes * changes * lib changes * changes * changes * fixes * PR comments * changes * updating the ownership * changes * changes * changes to container data * removing comment * changes * adding collection time * bug fix * env string truncation * changes for acs-engine test * Fix Telemetry Bug -- Initialize Telemetry Client after Initializing all required properties (#162) * Fix kube events memory leak due to yaml serialization for > 5k events (#163) * Setting Timeout for HTTP Client in PostDataHelper in outoms go plugin(#164) * Vishwa/perftelemetry 2 (#165) * add cpu usage telemetry for ds & rs * add cpu & memory usage telemetry for ds & rs * environment variable fix (#166) * environment variable fix * updating agent version * Fixing a bug where we were crashing due to container statuses not present when not was lost (#167) * Updating title * updating right versions for last release * Updating the break condition to look for end of response (#168) * Updating the break condition to look for end of response * changes for docker response * updating AgentVersion for telemetry * Updating readme for latest release changes * Changes - (#173) * use /var/log for state * new metric ContainerLogsAgentSideLatencyMs * new field 'timeOfComand' * Rashmi/kubenodeinventory (#174) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * Get cpuusage from usageseconds (#175) * Rashmi/kubenodeinventory (#176) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * Rashmi/kubenodeinventory (#178) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * Fixing an issue on the cpurate metric, which happens for the first time (when cache is empty) (#179) * Rashmi/kubenodeinventory (#180) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Exclude docker containers from container inventory (#181) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * Exclude pauseamd64 containers from container inventory (#182) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * Update agent version * Updating readme for the latest release * Fix indentation in kube.conf and update readme (#184) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * fixing indentation so that kube.conf contents can be used in config map in the yaml * updating readme to fix date and agent version * updating agent tag * Get Pods for current Node Only (#185) * Fix KubeAPI Calls to filter to get pods for current node * Reinstate log line * changes for container node inventory fixed type (#186) * Fix for mooncake (disable telemetry optionally) (#191) * disable telemetry option * fix a typo * CustomMetrics to ci_feature (#193) Custom Metrics changes to ci_feature * add ContainerNotRunning column to KubePodInventory * merge pr feedback: update name to ContainerStatusReason * Zero Fill for Missing Pod Phases, Change Namespace Dimension to Kubernetes namespace, as it might be confused with metrics namespace in Metrics Explorer (#194) * Zero Fill for Pod Counts by Phase * Change namespace dimension to Kubernetes namespace * No Retries for non 404 4xx errors (#196) * Update agent version for telemetry * Update readme for upcoming (ciprod01202019) release * fix readme formatting * fix formatting for readme * fix formatting for readme * fix readme * fix readme * fix agent version for telemetry * fix date in readme * update readme * Restart logs every 10MB instead of weekly (#198) * Rotate logs every 10MB instead of weekly * Removing some logging, fixed log rotation * update agent version for telemetry * update readme * Update kube.conf to use %STATE_DIR_WS% instead of hardcoded path * Fix AKSEngine Crash (#200) * hotfix * close resp.Body * remove chatty logs * membuf=5m and ignore files not updated since 5 mins * fix readme for new version * Fix the pod count in mdm agent plugin (#203) * Update readme * string freeze for out_mdm plugin * Vishwa/resourcecentric (#208) * resourceid fix (for AKS only) * fix name * Rashmi/win nodepool - PR (#206) * changes for win nodes enumeration * changes * changes * changes * node cpu metric rate changes * container cpu rate * changes * changes * changes * changes * changes * changes to include in_win_cadvisor_perf.rb file * send containerinventoryheartbeatevent * changes * cahnges for mdm metrics * changes * cahnges * changes * container states * changes * changes * changes for env variables * changes * changes * changes * changes * delete comments * changes * mutex changes * changes * changes * changes * telemetry fix for docker version * removing hardcoded values for mdm * update docker version * telemetry for windows cadvisor timeouts * exeception key update to computer * PR comments * adding os to container inventory for windows nodes (#210) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors (#211) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors * Fixing the bug, deferring telemetry changes for later * updating to lowercase compare for units (#212) * Merge from vishwa/telegraftcp to ci_feature for telegraf changes (#214) * merge from Vishwa/telegraf to Vishwa/telegraftcp for telegraf changes (#207) * add configuration for telegraf * fix for perms * fix telegraf config. * fix file location & config * update to config * fix namespace * trying different namespace and also debug=true * add placeholder for nodename * change namespace * updated config * fix uri * fix azMon settings * remove aad settings * add custom metrics regions * fix config * add support for replica-set config * fix oomkilled * Add telegraf 403 metric telemetry & non 403 trace telemetry * fix type * fix package * fix package import * fix filename * delete unused file * conf file for rs; fix 403counttotal metric for telegraf, remove host and use nodeName consistently, rename metrics * fix statefulsets * fix typo. * fix another typo. * fix telemetry * fix casing issue * fix comma issue. * disable telemetry for rs ; fix stateful set name * worksround for namespace fix * telegraf integration - v1 * telemetry changes for telegraf * telemetry & other changes * remove custom metric regions as we dont need anymore * remove un-needed files * fixes * exclude certain volumes and fix telemetry to not have computer & nodename as dimensions (redundant) * Vishwa/resourcecentric (#208) (#209) * resourceid fix (for AKS only) * fix name * near final metric shape * change from customlog to fixed type (InsightsMetrics) * fix PR feedback * fix pr feedback * Fix telemetry error for telegraf err count metric (#215) * Fix Unscheduled Pod bug, remove excess telemetry (#218) * Fix Unscheduled Pod bug, remove excess telemetry * Send Success Telemetry only once after startup for a node in a cluster for MDM Post * Sending telemetry for successful push to MDM every hour * Merge from Vishwa/promstandardmetrics into ci_feature (#220) * enable prometheus metrics collection in replica-set * fixing typos * fix config file path for replicaset * fix configuration * config changes * merge config/settings to ci_feature (#221) * updating fluentbit to use LOG_TAIL_PATH * changes * log exclusion pattern * changes * removing comments * adding enviornment varibale collection/disable * disable env var for cluster variable change * changes * toml parser changes * adding directory tomlrb * changes for container inventory * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Telemetry for config overrides * add schema version telemetry * reduce the number of api calls for namespace filtering add more telemetry for config processing move liveness probe & parser to this repo * optimize for default kube-system namespace log collection exclusion * Fix Scenario when Controller name is empty (#222) * fix ; * ContainerLog collection optimizations (#223) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * merge final changes for release from Vishwa/june2019agentrel to ci_feature (#224) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * fix a minor comment * * change flush from 5 to 10 secs based on perf findings * fix fluent bit tuning for perf run (#226) * fix fluent bit tuning for perf run * stop collecting our own partition * fix merge issue * add release notes for june release in ci_feature branch * fix title * update * fix title * Trim spaces in AKS_REGION (#233) This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Add Logs Size To Telemetry (#234) * Add Logs to telemetry * Using len instead of unsafe.Sizeof * Merge Vishwa/promcustommetrics to ci_feature (#237) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * Fix Region space error (#239) * Trim spaces in AKS_REGION This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Fix out_mdm parsing error * Removing buffer chunk size and buffer max size from fluentbit conf (#240) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * removing buffer chunk size and buffer max size from fluentbit conf * changes (#243) * Collect container last state (#235) * updating the OMS agent to also collect container last state * changed a comment * git surrounded ContainerLastStatus code in a begin/rescue block * added a lot of error checking and logging * Rashmi/fix prom telemetry (#247) * fix prom telemetry * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Merge Health Model work into ci_feature behind a feature flag Pending perf testing (#246) Merge Health to ci_feature * Fix Deserialization Bug (#249) * Fix the bug where capacity is not updated and cached value was being used (#251) * Fix the Capacity computation * fix node cpu and memory limits calculation * changes (#250) * Added new Custom Metrics Regions, fixed MDM plugin crash bug (#253) Added new regions, added handler for MDM plugin start * Add Missing Handlers (#254) * Added Missing Handlers * Return MultiEventStream.new instead of empty array (#256) * Added explicit require_relative to avoid loading errors (#258) * Adding explicit require_relative * Gangams/enable ai telemetry in mc (#252) * enable ai telemetry to configure different ikey and endpoint per cloud * Fixing null check out_mdm bug, tomlparser bug, exposing Replica Set service name as an ENV variable (#261) * Expose replica set service as an env variable * Fixing null check out_mdm bug, and tomlparser bug * Updating the env variable name to be more specific to health model * Changes for creating custom plugins with namespace settings for prometheus scraping (#262) * changes * changes * changes * changes * changes * changes * chnages * changes * telemetry changes * changes * Cherry-pick hotfix 09092019 to ci_feature (#265) * Gangams/add telemetry hybrid (#264) * add telemetry to detect the cloud, distro and kernel version * add null check since providerId optional * detect azurestack cloud * rename to KubernetesProviderID since ProviderID name already used in LA * capture workspaceCloud to the telemetry * trim the domain read from file * KubeMonAgentEvents changes to collect configuration events (#267) * changes * changes * changes * changes * changes * changes * env changes * changes * changes * changes * reverting * changes * cahnges * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * chnages * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Fix the Dupe Perf Data Issue from the DaemonSet (#266) * Dupe Perf Record Fix * PR for 1. Container Memory CPU monitor 2. Configuration for Node Conditions 3. Fixed Type Changes 4. Use Env variable, and health_forward (that handles network errors at init) 5. Unit Tests (#268) * init containers fix and other bug fixes (#269) * init container - KPI and kubeperf changes * changes * changes * changes * changes for empty array fix * changes * changes * pod inventory exception fix * nil check changes * changes * fixing typo * changes * changes * PR - feedback * remove comment * tag pass changes * changes * tagdrop changes * changes * changes * Send agg monitor signal on details change (#270) send when an agg monitor details change, but state did not change * bug fixes for error (#274) * Fix to use declaration and assignment instead of assignment (#275) * bug fixes for error * adding declaration to assignment * 1. Added telemetry (#277) 2. Configuration property changes 3. Bug fixes for a. unscheduled pods returning green 3b. Sometimes, the details hash of agg monitors are different because the order of elements inside the array is different, causing the records to be sent * Bug fix to remove unused variable (#281) * bug fixes for error * adding declaration to assignment * removing unused variable * Fix the WARN<->WARNING typo (#282) * Bug Fixes 1. telemetry send throwing exception if records not initialized 2. permissions error in on-prem clusters (#284) * Bug fixes 1. not writeable, telemetry error * Change to state_WS_dir * Fix Require relative revert (#287) * Bug Fixes for exceptions in telemetry, remove limit set check (#289) * Bug Fixes 10222019 * Initialize container_cpu_memory_records in fhmb * Added telemetry to investigate health exceptions * Set frozen_string_literal to true * Send event once per container when lookup is empty, or limit is an array * Unit Tests, Use RS and POD to determine workload * Fixed Node Condition Bug, added exception handling to return get_rs_owner_ref * Fix the bug where if a warning condition appears before fail condition, the node condition is reported as warning instead of fail. Also fix the node conditions state to consider unknown as a failure state (#292) * Fix for Nodes Aspect not showing up in draft cluster (#294) * Fix the issue where the health tree is inconsistent if a deployment is deleted (#295) * Rashmi/1 16 test (#297) * health deployment update * apps v1 changes for deployment * changes * changes to use relicasets and api groups * Fix duplicate records in container memory/cpu samples (#298) * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * fix exceptions (#306) * Merge Branch morgan into ci_feature (#308) * Fixes : 1) Disable health (for time being) - in DS & RS 2) Disable MDM (for time being) - in DS & RS 3) Merge kubeperf into kubenode & kubepod 4) Made scheduling predictable for kubenode & kubepod 5) Enable containerlog enrichment fields (timeofcommand, containername & containerimage) as a configurable setting (default = true/ON) - Also add telemetry for it 6) Filter OUT type!=Normal events for k8s events 7) AppInsights telemetry async 8) Fix double calling bug in in_win_cadvisor_perf 9) Add connect timeout (20secs) & read timeout (40 secs) for all cadvisor api calls & also for all kubernetes api server calls 10) Fix batchTime for kubepods to be one before making api server call (rather than after making the call, which will make it fluctuate based on api server latency for the call) * fix setting issue for the new enrichcontainerlog setting * fix compilation issue * fix another compilation issue * fix emit issues * fix a nil issue * fix mising tag * * Fix all input plugins for scheduling issue * Merge kubeservices with kubepodinventory (reduce RS to API server by one more) * Remove Kubelogs (not used) * Fix liveness probe * Disable enrichment by default for container logs * Move to yajl json parser across the board for docker provier code * Remove unused files * fix removed files * fix timeofcommand and remove a duplicate entry for a health file. * Rashmi/http leak fixes (#301) * changes for http connection close * close socket in ensure * adding nil check * Rashmi/http leak fixes (#303) * changes for http connection close * close socket in ensure * adding nil check * adding missing end * use yajl for events & nodes parsing. * Rashmi/http leak fixes (#304) * changes for http connection close * close socket in ensure * adding nil check * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * adding missing end * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * changes for chunking * telemetry changes * some fixes * bug fix * changing to have morgan changes only * add new line * use polltime for metrics and disable out_forward for health * enable mdm & health * few optimizations * do not remove time of command make kube.conf same as scale tested config * remove comments from container.conf * remove flush comment for ai telemetry * remove commented code lines * fix config * remove timeofcommand when enrichment==false * fix config * enable mdm filter * Rashmi/api chunk (#307) * changes * changes * refactor changes * changes * changes * changes * changes * node changes * changes * changes * changes * changes * adding open and read timeouts for api client * removing comments * updating chunk size * Update Readme * add back timeofcommand (#310) * update readme for timeofcommand fix (#314) * Merge from ci_feature_prod into ci_feature (fix put back timeofcommand) (#311) (#316) * Updatng release history * fixing the plugin logs for emit stream * updating log message * Remove Log Processing from fluentd configuration * Remove plugin references from base_container.data * Dilipr/fluent bit log processing (#126) * Build out_oms.so and include in docker-cimprov package * Adding fluent-bit-config file to base container * PR Feedback * Adding out_oms.conf to base_container.data * PR Feedback * Making the critical section as small as possible * PR Feedback * Fixing the newline bug for Computer, and changing containerId to Id * Dilipr/glide updates (#127) * Updating glide.* files to include lumberjack * containerID="" for pull issues * Using KubeAPI for getting image,name. Adding more logs (#129) * Using KubeAPI for getting image,name. Adding more logs * Moving log file and state file to within the omsagent container * Changing log and state paths * Dilipr/mark comments (#130) * Marks Comments + Error Handling * Drop records from files that are not in k8s format * Remove unnecessary log line' * Adding Log to the file that doesn't conform to the expected format * Rashmi/segfault latest (#132) * adding null checks in all providers * fixing type * fixing type * adding more null checks * update cjson * Adding a missed null check (#135) * reusing some variables (#136) * Rashmi/cjson delete null check (#138) * adding null check for cjson-delete * null chk * removing null check * updating log level to debug for some provider workflows (#139) * Fixing CPU Utilization and removing Fluent-bit filters (#140) Removing fluent-bit filters, CPU optimizations * Minor tweaks 1. Remove some logging 2. Added more Error Handling 3. Continue when there is an error with k8s api (#141) * Removing some logs, added more error checking, continue on kube-api error * Return FLB OK for json Marshall error, instead of RETRY * * Change FluentBit flush interval to 30 secs (from 5 secs) * Remove ContainerPerf, ContainerServiceLog,ContainerProcess (OMI workflows) for Daemonset * Container Log Telemetry * Fixing an issue with Send Init Event if Telemetry is not initialized properly, tab to whitespace in conf file * PR feedback * PR feedback * Sending an event every 5 mins(Heartbeat) (#146) * PR feedback to cleanup removed workflows * updating agent version for telemetry * updating agent version * Telemetry Updates (#149) * Telemetry Fixes 1. Added Log Generation Rate 2. Fixed parsing bugs 3. Added code to send Exceptions/errors * PR Feedback * Changes to send omsagent/omsagent-rs kubectl logs to App Insights (#159) * Changes to send omsagent/omsagent-rs kubectl logs to App Insights * PR Feedback * Rashmi/fluentd docker inventory (#160) * first stab * changes * changes * docker util changes * working tested util * input plugin and conf * changes * changes * changes * changes * changes * working containerinventory * fixing omi removal from container.conf * removing comments * file write and read * deleted containers working * changes * changes * socket timeout * deleting test files * adding log * fixing comment * appinsights changes * changes * tel changes * changes * changes * changes * changes * lib changes * changes * changes * fixes * PR comments * changes * updating the ownership * changes * changes * changes to container data * removing comment * changes * adding collection time * bug fix * env string truncation * changes for acs-engine test * Fix Telemetry Bug -- Initialize Telemetry Client after Initializing all required properties (#162) * Fix kube events memory leak due to yaml serialization for > 5k events (#163) * Setting Timeout for HTTP Client in PostDataHelper in outoms go plugin(#164) * Vishwa/perftelemetry 2 (#165) * add cpu usage telemetry for ds & rs * add cpu & memory usage telemetry for ds & rs * environment variable fix (#166) * environment variable fix * updating agent version * Fixing a bug where we were crashing due to container statuses not present when not was lost (#167) * Updating title * updating right versions for last release * Updating the break condition to look for end of response (#168) * Updating the break condition to look for end of response * changes for docker response * updating AgentVersion for telemetry * Updating readme for latest release changes * Changes - (#173) * use /var/log for state * new metric ContainerLogsAgentSideLatencyMs * new field 'timeOfComand' * Rashmi/kubenodeinventory (#174) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * Get cpuusage from usageseconds (#175) * Rashmi/kubenodeinventory (#176) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * Rashmi/kubenodeinventory (#178) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * Fixing an issue on the cpurate metric, which happens for the first time (when cache is empty) (#179) * Rashmi/kubenodeinventory (#180) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Exclude docker containers from container inventory (#181) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * Exclude pauseamd64 containers from container inventory (#182) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * Update agent version * Updating readme for the latest release * Fix indentation in kube.conf and update readme (#184) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * fixing indentation so that kube.conf contents can be used in config map in the yaml * updating readme to fix date and agent version * updating agent tag * Get Pods for current Node Only (#185) * Fix KubeAPI Calls to filter to get pods for current node * Reinstate log line * changes for container node inventory fixed type (#186) * Fix for mooncake (disable telemetry optionally) (#191) * disable telemetry option * fix a typo * CustomMetrics to ci_feature (#193) Custom Metrics changes to ci_feature * add ContainerNotRunning column to KubePodInventory * merge pr feedback: update name to ContainerStatusReason * Zero Fill for Missing Pod Phases, Change Namespace Dimension to Kubernetes namespace, as it might be confused with metrics namespace in Metrics Explorer (#194) * Zero Fill for Pod Counts by Phase * Change namespace dimension to Kubernetes namespace * No Retries for non 404 4xx errors (#196) * Update agent version for telemetry * Update readme for upcoming (ciprod01202019) release * fix readme formatting * fix formatting for readme * fix formatting for readme * fix readme * fix readme * fix agent version for telemetry * fix date in readme * update readme * Restart logs every 10MB instead of weekly (#198) * Rotate logs every 10MB instead of weekly * Removing some logging, fixed log rotation * update agent version for telemetry * update readme * Update kube.conf to use %STATE_DIR_WS% instead of hardcoded path * Fix AKSEngine Crash (#200) * hotfix * close resp.Body * remove chatty logs * membuf=5m and ignore files not updated since 5 mins * fix readme for new version * Fix the pod count in mdm agent plugin (#203) * Update readme * string freeze for out_mdm plugin * Vishwa/resourcecentric (#208) * resourceid fix (for AKS only) * fix name * Rashmi/win nodepool - PR (#206) * changes for win nodes enumeration * changes * changes * changes * node cpu metric rate changes * container cpu rate * changes * changes * changes * changes * changes * changes to include in_win_cadvisor_perf.rb file * send containerinventoryheartbeatevent * changes * cahnges for mdm metrics * changes * cahnges * changes * container states * changes * changes * changes for env variables * changes * changes * changes * changes * delete comments * changes * mutex changes * changes * changes * changes * telemetry fix for docker version * removing hardcoded values for mdm * update docker version * telemetry for windows cadvisor timeouts * exeception key update to computer * PR comments * adding os to container inventory for windows nodes (#210) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors (#211) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors * Fixing the bug, deferring telemetry changes for later * updating to lowercase compare for units (#212) * Merge from vishwa/telegraftcp to ci_feature for telegraf changes (#214) * merge from Vishwa/telegraf to Vishwa/telegraftcp for telegraf changes (#207) * add configuration for telegraf * fix for perms * fix telegraf config. * fix file location & config * update to config * fix namespace * trying different namespace and also debug=true * add placeholder for nodename * change namespace * updated config * fix uri * fix azMon settings * remove aad settings * add custom metrics regions * fix config * add support for replica-set config * fix oomkilled * Add telegraf 403 metric telemetry & non 403 trace telemetry * fix type * fix package * fix package import * fix filename * delete unused file * conf file for rs; fix 403counttotal metric for telegraf, remove host and use nodeName consistently, rename metrics * fix statefulsets * fix typo. * fix another typo. * fix telemetry * fix casing issue * fix comma issue. * disable telemetry for rs ; fix stateful set name * worksround for namespace fix * telegraf integration - v1 * telemetry changes for telegraf * telemetry & other changes * remove custom metric regions as we dont need anymore * remove un-needed files * fixes * exclude certain volumes and fix telemetry to not have computer & nodename as dimensions (redundant) * Vishwa/resourcecentric (#208) (#209) * resourceid fix (for AKS only) * fix name * near final metric shape * change from customlog to fixed type (InsightsMetrics) * fix PR feedback * fix pr feedback * Fix telemetry error for telegraf err count metric (#215) * Fix Unscheduled Pod bug, remove excess telemetry (#218) * Fix Unscheduled Pod bug, remove excess telemetry * Send Success Telemetry only once after startup for a node in a cluster for MDM Post * Sending telemetry for successful push to MDM every hour * Merge from Vishwa/promstandardmetrics into ci_feature (#220) * enable prometheus metrics collection in replica-set * fixing typos * fix config file path for replicaset * fix configuration * config changes * merge config/settings to ci_feature (#221) * updating fluentbit to use LOG_TAIL_PATH * changes * log exclusion pattern * changes * removing comments * adding enviornment varibale collection/disable * disable env var for cluster variable change * changes * toml parser changes * adding directory tomlrb * changes for container inventory * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Telemetry for config overrides * add schema version telemetry * reduce the number of api calls for namespace filtering add more telemetry for config processing move liveness probe & parser to this repo * optimize for default kube-system namespace log collection exclusion * Fix Scenario when Controller name is empty (#222) * fix ; * ContainerLog collection optimizations (#223) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * merge final changes for release from Vishwa/june2019agentrel to ci_feature (#224) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * fix a minor comment * * change flush from 5 to 10 secs based on perf findings * fix fluent bit tuning for perf run (#226) * fix fluent bit tuning for perf run * stop collecting our own partition * fix merge issue * add release notes for june release in ci_feature branch * fix title * update * fix title * Trim spaces in AKS_REGION (#233) This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Add Logs Size To Telemetry (#234) * Add Logs to telemetry * Using len instead of unsafe.Sizeof * Merge Vishwa/promcustommetrics to ci_feature (#237) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * Fix Region space error (#239) * Trim spaces in AKS_REGION This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Fix out_mdm parsing error * Removing buffer chunk size and buffer max size from fluentbit conf (#240) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * removing buffer chunk size and buffer max size from fluentbit conf * changes (#243) * Collect container last state (#235) * updating the OMS agent to also collect container last state * changed a comment * git surrounded ContainerLastStatus code in a begin/rescue block * added a lot of error checking and logging * Rashmi/fix prom telemetry (#247) * fix prom telemetry * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Merge Health Model work into ci_feature behind a feature flag Pending perf testing (#246) Merge Health to ci_feature * Fix Deserialization Bug (#249) * Fix the bug where capacity is not updated and cached value was being used (#251) * Fix the Capacity computation * fix node cpu and memory limits calculation * changes (#250) * Added new Custom Metrics Regions, fixed MDM plugin crash bug (#253) Added new regions, added handler for MDM plugin start * Add Missing Handlers (#254) * Added Missing Handlers * Return MultiEventStream.new instead of empty array (#256) * Added explicit require_relative to avoid loading errors (#258) * Adding explicit require_relative * Gangams/enable ai telemetry in mc (#252) * enable ai telemetry to configure different ikey and endpoint per cloud * Fixing null check out_mdm bug, tomlparser bug, exposing Replica Set service name as an ENV variable (#261) * Expose replica set service as an env variable * Fixing null check out_mdm bug, and tomlparser bug * Updating the env variable name to be more specific to health model * Changes for creating custom plugins with namespace settings for prometheus scraping (#262) * changes * changes * changes * changes * changes * changes * chnages * changes * telemetry changes * changes * Cherry-pick hotfix 09092019 to ci_feature (#265) * Gangams/add telemetry hybrid (#264) * add telemetry to detect the cloud, distro and kernel version * add null check since providerId optional * detect azurestack cloud * rename to KubernetesProviderID since ProviderID name already used in LA * capture workspaceCloud to the telemetry * trim the domain read from file * KubeMonAgentEvents changes to collect configuration events (#267) * changes * changes * changes * changes * changes * changes * env changes * changes * changes * changes * reverting * changes * cahnges * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * chnages * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Fix the Dupe Perf Data Issue from the DaemonSet (#266) * Dupe Perf Record Fix * PR for 1. Container Memory CPU monitor 2. Configuration for Node Conditions 3. Fixed Type Changes 4. Use Env variable, and health_forward (that handles network errors at init) 5. Unit Tests (#268) * init containers fix and other bug fixes (#269) * init container - KPI and kubeperf changes * changes * changes * changes * changes for empty array fix * changes * changes * pod inventory exception fix * nil check changes * changes * fixing typo * changes * changes * PR - feedback * remove comment * tag pass changes * changes * tagdrop changes * changes * changes * Send agg monitor signal on details change (#270) send when an agg monitor details change, but state did not change * bug fixes for error (#274) * Fix to use declaration and assignment instead of assignment (#275) * bug fixes for error * adding declaration to assignment * 1. Added telemetry (#277) 2. Configuration property changes 3. Bug fixes for a. unscheduled pods returning green 3b. Sometimes, the details hash of agg monitors are different because the order of elements inside the array is different, causing the records to be sent * Bug fix to remove unused variable (#281) * bug fixes for error * adding declaration to assignment * removing unused variable * Fix the WARN<->WARNING typo (#282) * Bug Fixes 1. telemetry send throwing exception if records not initialized 2. permissions error in on-prem clusters (#284) * Bug fixes 1. not writeable, telemetry error * Change to state_WS_dir * Fix Require relative revert (#287) * Bug Fixes for exceptions in telemetry, remove limit set check (#289) * Bug Fixes 10222019 * Initialize container_cpu_memory_records in fhmb * Added telemetry to investigate health exceptions * Set frozen_string_literal to true * Send event once per container when lookup is empty, or limit is an array * Unit Tests, Use RS and POD to determine workload * Fixed Node Condition Bug, added exception handling to return get_rs_owner_ref * Fix the bug where if a warning condition appears before fail condition, the node condition is reported as warning instead of fail. Also fix the node conditions state to consider unknown as a failure state (#292) * Fix for Nodes Aspect not showing up in draft cluster (#294) * Fix the issue where the health tree is inconsistent if a deployment is deleted (#295) * Rashmi/1 16 test (#297) * health deployment update * apps v1 changes for deployment * changes * changes to use relicasets and api groups * Fix duplicate records in container memory/cpu samples (#298) * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * fix exceptions (#306) * Merge Branch morgan into ci_feature (#308) * Fixes : 1) Disable health (for time being) - in DS & RS 2) Disable MDM (for time being) - in DS & RS 3) Merge kubeperf into kubenode & kubepod 4) Made scheduling predictable for kubenode & kubepod 5) Enable containerlog enrichment fields (timeofcommand, containername & containerimage) as a configurable setting (default = true/ON) - Also add telemetry for it 6) Filter OUT type!=Normal events for k8s events 7) AppInsights telemetry async 8) Fix double calling bug in in_win_cadvisor_perf 9) Add connect timeout (20secs) & read timeout (40 secs) for all cadvisor api calls & also for all kubernetes api server calls 10) Fix batchTime for kubepods to be one before making api server call (rather than after making the call, which will make it fluctuate based on api server latency for the call) * fix setting issue for the new enrichcontainerlog setting * fix compilation issue * fix another compilation issue * fix emit issues * fix a nil issue * fix mising tag * * Fix all input plugins for scheduling issue * Merge kubeservices with kubepodinventory (reduce RS to API server by one more) * Remove Kubelogs (not used) * Fix liveness probe * Disable enrichment by default for container logs * Move to yajl json parser across the board for docker provier code * Remove unused files * fix removed files * fix timeofcommand and remove a duplicate entry for a health file. * Rashmi/http leak fixes (#301) * changes for http connection close * close socket in ensure * adding nil check * Rashmi/http leak fixes (#303) * changes for http connection close * close socket in ensure * adding nil check * adding missing end * use yajl for events & nodes parsing. * Rashmi/http leak fixes (#304) * changes for http connection close * close socket in ensure * adding nil check * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * adding missing end * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * changes for chunking * telemetry changes * some fixes * bug fix * changing to have morgan changes only * add new line * use polltime for metrics and disable out_forward for health * enable mdm & health * few optimizations * do not remove time of command make kube.conf same as scale tested config * remove comments from container.conf * remove flush comment for ai telemetry * remove commented code lines * fix config * remove timeofcommand when enrichment==false * fix config * enable mdm filter * Rashmi/api chunk (#307) * changes * changes * refactor changes * changes * changes * changes * changes * node changes * changes * changes * changes * changes * adding open and read timeouts for api client * removing comments * updating chunk size * Update Readme * add back timeofcommand (#310) * Adding new cpu and memory limits to readme * CAdvisor to use 10255/10250 based on env variable (#321) * CAdvisor secure port changes (#320) * cadvsior secure port changes * update to use secure/insecure port for cadvisor * telemetry changes * fix bug * bug fix * changes * Adding cadvisor uri log * switching defaults * update readme * changes * changing font for code change and customer impact Co-authored-by: Vishwanath <visnara@microsoft.com> Co-authored-by: Dilip Raghunathan <dilip.rangarajan@gmail.com> Co-authored-by: bragi92 <kadubey@microsoft.com> Co-authored-by: David Michelman <daweim0@gmail.com> Co-authored-by: ganga1980 <gangams@microsoft.com>
rashmichandrashekar
added a commit
that referenced
this pull request
Feb 26, 2020
* Updatng release history * fixing the plugin logs for emit stream * updating log message * Remove Log Processing from fluentd configuration * Remove plugin references from base_container.data * Dilipr/fluent bit log processing (#126) * Build out_oms.so and include in docker-cimprov package * Adding fluent-bit-config file to base container * PR Feedback * Adding out_oms.conf to base_container.data * PR Feedback * Making the critical section as small as possible * PR Feedback * Fixing the newline bug for Computer, and changing containerId to Id * Dilipr/glide updates (#127) * Updating glide.* files to include lumberjack * containerID="" for pull issues * Using KubeAPI for getting image,name. Adding more logs (#129) * Using KubeAPI for getting image,name. Adding more logs * Moving log file and state file to within the omsagent container * Changing log and state paths * Dilipr/mark comments (#130) * Marks Comments + Error Handling * Drop records from files that are not in k8s format * Remove unnecessary log line' * Adding Log to the file that doesn't conform to the expected format * Rashmi/segfault latest (#132) * adding null checks in all providers * fixing type * fixing type * adding more null checks * update cjson * Adding a missed null check (#135) * reusing some variables (#136) * Rashmi/cjson delete null check (#138) * adding null check for cjson-delete * null chk * removing null check * updating log level to debug for some provider workflows (#139) * Fixing CPU Utilization and removing Fluent-bit filters (#140) Removing fluent-bit filters, CPU optimizations * Minor tweaks 1. Remove some logging 2. Added more Error Handling 3. Continue when there is an error with k8s api (#141) * Removing some logs, added more error checking, continue on kube-api error * Return FLB OK for json Marshall error, instead of RETRY * * Change FluentBit flush interval to 30 secs (from 5 secs) * Remove ContainerPerf, ContainerServiceLog,ContainerProcess (OMI workflows) for Daemonset * Container Log Telemetry * Fixing an issue with Send Init Event if Telemetry is not initialized properly, tab to whitespace in conf file * PR feedback * PR feedback * Sending an event every 5 mins(Heartbeat) (#146) * PR feedback to cleanup removed workflows * updating agent version for telemetry * updating agent version * Telemetry Updates (#149) * Telemetry Fixes 1. Added Log Generation Rate 2. Fixed parsing bugs 3. Added code to send Exceptions/errors * PR Feedback * Changes to send omsagent/omsagent-rs kubectl logs to App Insights (#159) * Changes to send omsagent/omsagent-rs kubectl logs to App Insights * PR Feedback * Rashmi/fluentd docker inventory (#160) * first stab * changes * changes * docker util changes * working tested util * input plugin and conf * changes * changes * changes * changes * changes * working containerinventory * fixing omi removal from container.conf * removing comments * file write and read * deleted containers working * changes * changes * socket timeout * deleting test files * adding log * fixing comment * appinsights changes * changes * tel changes * changes * changes * changes * changes * lib changes * changes * changes * fixes * PR comments * changes * updating the ownership * changes * changes * changes to container data * removing comment * changes * adding collection time * bug fix * env string truncation * changes for acs-engine test * Fix Telemetry Bug -- Initialize Telemetry Client after Initializing all required properties (#162) * Fix kube events memory leak due to yaml serialization for > 5k events (#163) * Setting Timeout for HTTP Client in PostDataHelper in outoms go plugin(#164) * Vishwa/perftelemetry 2 (#165) * add cpu usage telemetry for ds & rs * add cpu & memory usage telemetry for ds & rs * environment variable fix (#166) * environment variable fix * updating agent version * Fixing a bug where we were crashing due to container statuses not present when not was lost (#167) * Updating title * updating right versions for last release * Updating the break condition to look for end of response (#168) * Updating the break condition to look for end of response * changes for docker response * updating AgentVersion for telemetry * Updating readme for latest release changes * Changes - (#173) * use /var/log for state * new metric ContainerLogsAgentSideLatencyMs * new field 'timeOfComand' * Rashmi/kubenodeinventory (#174) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * Get cpuusage from usageseconds (#175) * Rashmi/kubenodeinventory (#176) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * Rashmi/kubenodeinventory (#178) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * Fixing an issue on the cpurate metric, which happens for the first time (when cache is empty) (#179) * Rashmi/kubenodeinventory (#180) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Exclude docker containers from container inventory (#181) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * Exclude pauseamd64 containers from container inventory (#182) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * Update agent version * Updating readme for the latest release * Fix indentation in kube.conf and update readme (#184) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * fixing indentation so that kube.conf contents can be used in config map in the yaml * updating readme to fix date and agent version * updating agent tag * Get Pods for current Node Only (#185) * Fix KubeAPI Calls to filter to get pods for current node * Reinstate log line * changes for container node inventory fixed type (#186) * Fix for mooncake (disable telemetry optionally) (#191) * disable telemetry option * fix a typo * CustomMetrics to ci_feature (#193) Custom Metrics changes to ci_feature * add ContainerNotRunning column to KubePodInventory * merge pr feedback: update name to ContainerStatusReason * Zero Fill for Missing Pod Phases, Change Namespace Dimension to Kubernetes namespace, as it might be confused with metrics namespace in Metrics Explorer (#194) * Zero Fill for Pod Counts by Phase * Change namespace dimension to Kubernetes namespace * No Retries for non 404 4xx errors (#196) * Update agent version for telemetry * Update readme for upcoming (ciprod01202019) release * fix readme formatting * fix formatting for readme * fix formatting for readme * fix readme * fix readme * fix agent version for telemetry * fix date in readme * update readme * Restart logs every 10MB instead of weekly (#198) * Rotate logs every 10MB instead of weekly * Removing some logging, fixed log rotation * update agent version for telemetry * update readme * Update kube.conf to use %STATE_DIR_WS% instead of hardcoded path * Fix AKSEngine Crash (#200) * hotfix * close resp.Body * remove chatty logs * membuf=5m and ignore files not updated since 5 mins * fix readme for new version * Fix the pod count in mdm agent plugin (#203) * Update readme * string freeze for out_mdm plugin * Vishwa/resourcecentric (#208) * resourceid fix (for AKS only) * fix name * Rashmi/win nodepool - PR (#206) * changes for win nodes enumeration * changes * changes * changes * node cpu metric rate changes * container cpu rate * changes * changes * changes * changes * changes * changes to include in_win_cadvisor_perf.rb file * send containerinventoryheartbeatevent * changes * cahnges for mdm metrics * changes * cahnges * changes * container states * changes * changes * changes for env variables * changes * changes * changes * changes * delete comments * changes * mutex changes * changes * changes * changes * telemetry fix for docker version * removing hardcoded values for mdm * update docker version * telemetry for windows cadvisor timeouts * exeception key update to computer * PR comments * adding os to container inventory for windows nodes (#210) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors (#211) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors * Fixing the bug, deferring telemetry changes for later * updating to lowercase compare for units (#212) * Merge from vishwa/telegraftcp to ci_feature for telegraf changes (#214) * merge from Vishwa/telegraf to Vishwa/telegraftcp for telegraf changes (#207) * add configuration for telegraf * fix for perms * fix telegraf config. * fix file location & config * update to config * fix namespace * trying different namespace and also debug=true * add placeholder for nodename * change namespace * updated config * fix uri * fix azMon settings * remove aad settings * add custom metrics regions * fix config * add support for replica-set config * fix oomkilled * Add telegraf 403 metric telemetry & non 403 trace telemetry * fix type * fix package * fix package import * fix filename * delete unused file * conf file for rs; fix 403counttotal metric for telegraf, remove host and use nodeName consistently, rename metrics * fix statefulsets * fix typo. * fix another typo. * fix telemetry * fix casing issue * fix comma issue. * disable telemetry for rs ; fix stateful set name * worksround for namespace fix * telegraf integration - v1 * telemetry changes for telegraf * telemetry & other changes * remove custom metric regions as we dont need anymore * remove un-needed files * fixes * exclude certain volumes and fix telemetry to not have computer & nodename as dimensions (redundant) * Vishwa/resourcecentric (#208) (#209) * resourceid fix (for AKS only) * fix name * near final metric shape * change from customlog to fixed type (InsightsMetrics) * fix PR feedback * fix pr feedback * Fix telemetry error for telegraf err count metric (#215) * Fix Unscheduled Pod bug, remove excess telemetry (#218) * Fix Unscheduled Pod bug, remove excess telemetry * Send Success Telemetry only once after startup for a node in a cluster for MDM Post * Sending telemetry for successful push to MDM every hour * Merge from Vishwa/promstandardmetrics into ci_feature (#220) * enable prometheus metrics collection in replica-set * fixing typos * fix config file path for replicaset * fix configuration * config changes * merge config/settings to ci_feature (#221) * updating fluentbit to use LOG_TAIL_PATH * changes * log exclusion pattern * changes * removing comments * adding enviornment varibale collection/disable * disable env var for cluster variable change * changes * toml parser changes * adding directory tomlrb * changes for container inventory * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Telemetry for config overrides * add schema version telemetry * reduce the number of api calls for namespace filtering add more telemetry for config processing move liveness probe & parser to this repo * optimize for default kube-system namespace log collection exclusion * Fix Scenario when Controller name is empty (#222) * fix ; * ContainerLog collection optimizations (#223) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * merge final changes for release from Vishwa/june2019agentrel to ci_feature (#224) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * fix a minor comment * * change flush from 5 to 10 secs based on perf findings * fix fluent bit tuning for perf run (#226) * fix fluent bit tuning for perf run * stop collecting our own partition * fix merge issue * add release notes for june release in ci_feature branch * fix title * update * fix title * Trim spaces in AKS_REGION (#233) This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Add Logs Size To Telemetry (#234) * Add Logs to telemetry * Using len instead of unsafe.Sizeof * Merge Vishwa/promcustommetrics to ci_feature (#237) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * Fix Region space error (#239) * Trim spaces in AKS_REGION This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Fix out_mdm parsing error * Removing buffer chunk size and buffer max size from fluentbit conf (#240) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * removing buffer chunk size and buffer max size from fluentbit conf * changes (#243) * Collect container last state (#235) * updating the OMS agent to also collect container last state * changed a comment * git surrounded ContainerLastStatus code in a begin/rescue block * added a lot of error checking and logging * Rashmi/fix prom telemetry (#247) * fix prom telemetry * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Merge Health Model work into ci_feature behind a feature flag Pending perf testing (#246) Merge Health to ci_feature * Fix Deserialization Bug (#249) * Fix the bug where capacity is not updated and cached value was being used (#251) * Fix the Capacity computation * fix node cpu and memory limits calculation * changes (#250) * Added new Custom Metrics Regions, fixed MDM plugin crash bug (#253) Added new regions, added handler for MDM plugin start * Add Missing Handlers (#254) * Added Missing Handlers * Return MultiEventStream.new instead of empty array (#256) * Added explicit require_relative to avoid loading errors (#258) * Adding explicit require_relative * Gangams/enable ai telemetry in mc (#252) * enable ai telemetry to configure different ikey and endpoint per cloud * Fixing null check out_mdm bug, tomlparser bug, exposing Replica Set service name as an ENV variable (#261) * Expose replica set service as an env variable * Fixing null check out_mdm bug, and tomlparser bug * Updating the env variable name to be more specific to health model * Changes for creating custom plugins with namespace settings for prometheus scraping (#262) * changes * changes * changes * changes * changes * changes * chnages * changes * telemetry changes * changes * Cherry-pick hotfix 09092019 to ci_feature (#265) * Gangams/add telemetry hybrid (#264) * add telemetry to detect the cloud, distro and kernel version * add null check since providerId optional * detect azurestack cloud * rename to KubernetesProviderID since ProviderID name already used in LA * capture workspaceCloud to the telemetry * trim the domain read from file * KubeMonAgentEvents changes to collect configuration events (#267) * changes * changes * changes * changes * changes * changes * env changes * changes * changes * changes * reverting * changes * cahnges * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * chnages * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Fix the Dupe Perf Data Issue from the DaemonSet (#266) * Dupe Perf Record Fix * PR for 1. Container Memory CPU monitor 2. Configuration for Node Conditions 3. Fixed Type Changes 4. Use Env variable, and health_forward (that handles network errors at init) 5. Unit Tests (#268) * init containers fix and other bug fixes (#269) * init container - KPI and kubeperf changes * changes * changes * changes * changes for empty array fix * changes * changes * pod inventory exception fix * nil check changes * changes * fixing typo * changes * changes * PR - feedback * remove comment * tag pass changes * changes * tagdrop changes * changes * changes * Send agg monitor signal on details change (#270) send when an agg monitor details change, but state did not change * bug fixes for error (#274) * Fix to use declaration and assignment instead of assignment (#275) * bug fixes for error * adding declaration to assignment * 1. Added telemetry (#277) 2. Configuration property changes 3. Bug fixes for a. unscheduled pods returning green 3b. Sometimes, the details hash of agg monitors are different because the order of elements inside the array is different, causing the records to be sent * Bug fix to remove unused variable (#281) * bug fixes for error * adding declaration to assignment * removing unused variable * Fix the WARN<->WARNING typo (#282) * Bug Fixes 1. telemetry send throwing exception if records not initialized 2. permissions error in on-prem clusters (#284) * Bug fixes 1. not writeable, telemetry error * Change to state_WS_dir * Fix Require relative revert (#287) * Bug Fixes for exceptions in telemetry, remove limit set check (#289) * Bug Fixes 10222019 * Initialize container_cpu_memory_records in fhmb * Added telemetry to investigate health exceptions * Set frozen_string_literal to true * Send event once per container when lookup is empty, or limit is an array * Unit Tests, Use RS and POD to determine workload * Fixed Node Condition Bug, added exception handling to return get_rs_owner_ref * Fix the bug where if a warning condition appears before fail condition, the node condition is reported as warning instead of fail. Also fix the node conditions state to consider unknown as a failure state (#292) * Fix for Nodes Aspect not showing up in draft cluster (#294) * Fix the issue where the health tree is inconsistent if a deployment is deleted (#295) * Rashmi/1 16 test (#297) * health deployment update * apps v1 changes for deployment * changes * changes to use relicasets and api groups * Fix duplicate records in container memory/cpu samples (#298) * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * fix exceptions (#306) * Merge Branch morgan into ci_feature (#308) * Fixes : 1) Disable health (for time being) - in DS & RS 2) Disable MDM (for time being) - in DS & RS 3) Merge kubeperf into kubenode & kubepod 4) Made scheduling predictable for kubenode & kubepod 5) Enable containerlog enrichment fields (timeofcommand, containername & containerimage) as a configurable setting (default = true/ON) - Also add telemetry for it 6) Filter OUT type!=Normal events for k8s events 7) AppInsights telemetry async 8) Fix double calling bug in in_win_cadvisor_perf 9) Add connect timeout (20secs) & read timeout (40 secs) for all cadvisor api calls & also for all kubernetes api server calls 10) Fix batchTime for kubepods to be one before making api server call (rather than after making the call, which will make it fluctuate based on api server latency for the call) * fix setting issue for the new enrichcontainerlog setting * fix compilation issue * fix another compilation issue * fix emit issues * fix a nil issue * fix mising tag * * Fix all input plugins for scheduling issue * Merge kubeservices with kubepodinventory (reduce RS to API server by one more) * Remove Kubelogs (not used) * Fix liveness probe * Disable enrichment by default for container logs * Move to yajl json parser across the board for docker provier code * Remove unused files * fix removed files * fix timeofcommand and remove a duplicate entry for a health file. * Rashmi/http leak fixes (#301) * changes for http connection close * close socket in ensure * adding nil check * Rashmi/http leak fixes (#303) * changes for http connection close * close socket in ensure * adding nil check * adding missing end * use yajl for events & nodes parsing. * Rashmi/http leak fixes (#304) * changes for http connection close * close socket in ensure * adding nil check * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * adding missing end * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * changes for chunking * telemetry changes * some fixes * bug fix * changing to have morgan changes only * add new line * use polltime for metrics and disable out_forward for health * enable mdm & health * few optimizations * do not remove time of command make kube.conf same as scale tested config * remove comments from container.conf * remove flush comment for ai telemetry * remove commented code lines * fix config * remove timeofcommand when enrichment==false * fix config * enable mdm filter * Rashmi/api chunk (#307) * changes * changes * refactor changes * changes * changes * changes * changes * node changes * changes * changes * changes * changes * adding open and read timeouts for api client * removing comments * updating chunk size * Update Readme * add back timeofcommand (#310) * update readme for timeofcommand fix (#314) * Merge from ci_feature_prod into ci_feature (fix put back timeofcommand) (#311) (#316) * Updatng release history * fixing the plugin logs for emit stream * updating log message * Remove Log Processing from fluentd configuration * Remove plugin references from base_container.data * Dilipr/fluent bit log processing (#126) * Build out_oms.so and include in docker-cimprov package * Adding fluent-bit-config file to base container * PR Feedback * Adding out_oms.conf to base_container.data * PR Feedback * Making the critical section as small as possible * PR Feedback * Fixing the newline bug for Computer, and changing containerId to Id * Dilipr/glide updates (#127) * Updating glide.* files to include lumberjack * containerID="" for pull issues * Using KubeAPI for getting image,name. Adding more logs (#129) * Using KubeAPI for getting image,name. Adding more logs * Moving log file and state file to within the omsagent container * Changing log and state paths * Dilipr/mark comments (#130) * Marks Comments + Error Handling * Drop records from files that are not in k8s format * Remove unnecessary log line' * Adding Log to the file that doesn't conform to the expected format * Rashmi/segfault latest (#132) * adding null checks in all providers * fixing type * fixing type * adding more null checks * update cjson * Adding a missed null check (#135) * reusing some variables (#136) * Rashmi/cjson delete null check (#138) * adding null check for cjson-delete * null chk * removing null check * updating log level to debug for some provider workflows (#139) * Fixing CPU Utilization and removing Fluent-bit filters (#140) Removing fluent-bit filters, CPU optimizations * Minor tweaks 1. Remove some logging 2. Added more Error Handling 3. Continue when there is an error with k8s api (#141) * Removing some logs, added more error checking, continue on kube-api error * Return FLB OK for json Marshall error, instead of RETRY * * Change FluentBit flush interval to 30 secs (from 5 secs) * Remove ContainerPerf, ContainerServiceLog,ContainerProcess (OMI workflows) for Daemonset * Container Log Telemetry * Fixing an issue with Send Init Event if Telemetry is not initialized properly, tab to whitespace in conf file * PR feedback * PR feedback * Sending an event every 5 mins(Heartbeat) (#146) * PR feedback to cleanup removed workflows * updating agent version for telemetry * updating agent version * Telemetry Updates (#149) * Telemetry Fixes 1. Added Log Generation Rate 2. Fixed parsing bugs 3. Added code to send Exceptions/errors * PR Feedback * Changes to send omsagent/omsagent-rs kubectl logs to App Insights (#159) * Changes to send omsagent/omsagent-rs kubectl logs to App Insights * PR Feedback * Rashmi/fluentd docker inventory (#160) * first stab * changes * changes * docker util changes * working tested util * input plugin and conf * changes * changes * changes * changes * changes * working containerinventory * fixing omi removal from container.conf * removing comments * file write and read * deleted containers working * changes * changes * socket timeout * deleting test files * adding log * fixing comment * appinsights changes * changes * tel changes * changes * changes * changes * changes * lib changes * changes * changes * fixes * PR comments * changes * updating the ownership * changes * changes * changes to container data * removing comment * changes * adding collection time * bug fix * env string truncation * changes for acs-engine test * Fix Telemetry Bug -- Initialize Telemetry Client after Initializing all required properties (#162) * Fix kube events memory leak due to yaml serialization for > 5k events (#163) * Setting Timeout for HTTP Client in PostDataHelper in outoms go plugin(#164) * Vishwa/perftelemetry 2 (#165) * add cpu usage telemetry for ds & rs * add cpu & memory usage telemetry for ds & rs * environment variable fix (#166) * environment variable fix * updating agent version * Fixing a bug where we were crashing due to container statuses not present when not was lost (#167) * Updating title * updating right versions for last release * Updating the break condition to look for end of response (#168) * Updating the break condition to look for end of response * changes for docker response * updating AgentVersion for telemetry * Updating readme for latest release changes * Changes - (#173) * use /var/log for state * new metric ContainerLogsAgentSideLatencyMs * new field 'timeOfComand' * Rashmi/kubenodeinventory (#174) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * Get cpuusage from usageseconds (#175) * Rashmi/kubenodeinventory (#176) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * Rashmi/kubenodeinventory (#178) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * Fixing an issue on the cpurate metric, which happens for the first time (when cache is empty) (#179) * Rashmi/kubenodeinventory (#180) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Exclude docker containers from container inventory (#181) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * Exclude pauseamd64 containers from container inventory (#182) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * Update agent version * Updating readme for the latest release * Fix indentation in kube.conf and update readme (#184) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * fixing indentation so that kube.conf contents can be used in config map in the yaml * updating readme to fix date and agent version * updating agent tag * Get Pods for current Node Only (#185) * Fix KubeAPI Calls to filter to get pods for current node * Reinstate log line * changes for container node inventory fixed type (#186) * Fix for mooncake (disable telemetry optionally) (#191) * disable telemetry option * fix a typo * CustomMetrics to ci_feature (#193) Custom Metrics changes to ci_feature * add ContainerNotRunning column to KubePodInventory * merge pr feedback: update name to ContainerStatusReason * Zero Fill for Missing Pod Phases, Change Namespace Dimension to Kubernetes namespace, as it might be confused with metrics namespace in Metrics Explorer (#194) * Zero Fill for Pod Counts by Phase * Change namespace dimension to Kubernetes namespace * No Retries for non 404 4xx errors (#196) * Update agent version for telemetry * Update readme for upcoming (ciprod01202019) release * fix readme formatting * fix formatting for readme * fix formatting for readme * fix readme * fix readme * fix agent version for telemetry * fix date in readme * update readme * Restart logs every 10MB instead of weekly (#198) * Rotate logs every 10MB instead of weekly * Removing some logging, fixed log rotation * update agent version for telemetry * update readme * Update kube.conf to use %STATE_DIR_WS% instead of hardcoded path * Fix AKSEngine Crash (#200) * hotfix * close resp.Body * remove chatty logs * membuf=5m and ignore files not updated since 5 mins * fix readme for new version * Fix the pod count in mdm agent plugin (#203) * Update readme * string freeze for out_mdm plugin * Vishwa/resourcecentric (#208) * resourceid fix (for AKS only) * fix name * Rashmi/win nodepool - PR (#206) * changes for win nodes enumeration * changes * changes * changes * node cpu metric rate changes * container cpu rate * changes * changes * changes * changes * changes * changes to include in_win_cadvisor_perf.rb file * send containerinventoryheartbeatevent * changes * cahnges for mdm metrics * changes * cahnges * changes * container states * changes * changes * changes for env variables * changes * changes * changes * changes * delete comments * changes * mutex changes * changes * changes * changes * telemetry fix for docker version * removing hardcoded values for mdm * update docker version * telemetry for windows cadvisor timeouts * exeception key update to computer * PR comments * adding os to container inventory for windows nodes (#210) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors (#211) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors * Fixing the bug, deferring telemetry changes for later * updating to lowercase compare for units (#212) * Merge from vishwa/telegraftcp to ci_feature for telegraf changes (#214) * merge from Vishwa/telegraf to Vishwa/telegraftcp for telegraf changes (#207) * add configuration for telegraf * fix for perms * fix telegraf config. * fix file location & config * update to config * fix namespace * trying different namespace and also debug=true * add placeholder for nodename * change namespace * updated config * fix uri * fix azMon settings * remove aad settings * add custom metrics regions * fix config * add support for replica-set config * fix oomkilled * Add telegraf 403 metric telemetry & non 403 trace telemetry * fix type * fix package * fix package import * fix filename * delete unused file * conf file for rs; fix 403counttotal metric for telegraf, remove host and use nodeName consistently, rename metrics * fix statefulsets * fix typo. * fix another typo. * fix telemetry * fix casing issue * fix comma issue. * disable telemetry for rs ; fix stateful set name * worksround for namespace fix * telegraf integration - v1 * telemetry changes for telegraf * telemetry & other changes * remove custom metric regions as we dont need anymore * remove un-needed files * fixes * exclude certain volumes and fix telemetry to not have computer & nodename as dimensions (redundant) * Vishwa/resourcecentric (#208) (#209) * resourceid fix (for AKS only) * fix name * near final metric shape * change from customlog to fixed type (InsightsMetrics) * fix PR feedback * fix pr feedback * Fix telemetry error for telegraf err count metric (#215) * Fix Unscheduled Pod bug, remove excess telemetry (#218) * Fix Unscheduled Pod bug, remove excess telemetry * Send Success Telemetry only once after startup for a node in a cluster for MDM Post * Sending telemetry for successful push to MDM every hour * Merge from Vishwa/promstandardmetrics into ci_feature (#220) * enable prometheus metrics collection in replica-set * fixing typos * fix config file path for replicaset * fix configuration * config changes * merge config/settings to ci_feature (#221) * updating fluentbit to use LOG_TAIL_PATH * changes * log exclusion pattern * changes * removing comments * adding enviornment varibale collection/disable * disable env var for cluster variable change * changes * toml parser changes * adding directory tomlrb * changes for container inventory * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Telemetry for config overrides * add schema version telemetry * reduce the number of api calls for namespace filtering add more telemetry for config processing move liveness probe & parser to this repo * optimize for default kube-system namespace log collection exclusion * Fix Scenario when Controller name is empty (#222) * fix ; * ContainerLog collection optimizations (#223) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * merge final changes for release from Vishwa/june2019agentrel to ci_feature (#224) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * fix a minor comment * * change flush from 5 to 10 secs based on perf findings * fix fluent bit tuning for perf run (#226) * fix fluent bit tuning for perf run * stop collecting our own partition * fix merge issue * add release notes for june release in ci_feature branch * fix title * update * fix title * Trim spaces in AKS_REGION (#233) This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Add Logs Size To Telemetry (#234) * Add Logs to telemetry * Using len instead of unsafe.Sizeof * Merge Vishwa/promcustommetrics to ci_feature (#237) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * Fix Region space error (#239) * Trim spaces in AKS_REGION This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Fix out_mdm parsing error * Removing buffer chunk size and buffer max size from fluentbit conf (#240) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * removing buffer chunk size and buffer max size from fluentbit conf * changes (#243) * Collect container last state (#235) * updating the OMS agent to also collect container last state * changed a comment * git surrounded ContainerLastStatus code in a begin/rescue block * added a lot of error checking and logging * Rashmi/fix prom telemetry (#247) * fix prom telemetry * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Merge Health Model work into ci_feature behind a feature flag Pending perf testing (#246) Merge Health to ci_feature * Fix Deserialization Bug (#249) * Fix the bug where capacity is not updated and cached value was being used (#251) * Fix the Capacity computation * fix node cpu and memory limits calculation * changes (#250) * Added new Custom Metrics Regions, fixed MDM plugin crash bug (#253) Added new regions, added handler for MDM plugin start * Add Missing Handlers (#254) * Added Missing Handlers * Return MultiEventStream.new instead of empty array (#256) * Added explicit require_relative to avoid loading errors (#258) * Adding explicit require_relative * Gangams/enable ai telemetry in mc (#252) * enable ai telemetry to configure different ikey and endpoint per cloud * Fixing null check out_mdm bug, tomlparser bug, exposing Replica Set service name as an ENV variable (#261) * Expose replica set service as an env variable * Fixing null check out_mdm bug, and tomlparser bug * Updating the env variable name to be more specific to health model * Changes for creating custom plugins with namespace settings for prometheus scraping (#262) * changes * changes * changes * changes * changes * changes * chnages * changes * telemetry changes * changes * Cherry-pick hotfix 09092019 to ci_feature (#265) * Gangams/add telemetry hybrid (#264) * add telemetry to detect the cloud, distro and kernel version * add null check since providerId optional * detect azurestack cloud * rename to KubernetesProviderID since ProviderID name already used in LA * capture workspaceCloud to the telemetry * trim the domain read from file * KubeMonAgentEvents changes to collect configuration events (#267) * changes * changes * changes * changes * changes * changes * env changes * changes * changes * changes * reverting * changes * cahnges * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * chnages * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Fix the Dupe Perf Data Issue from the DaemonSet (#266) * Dupe Perf Record Fix * PR for 1. Container Memory CPU monitor 2. Configuration for Node Conditions 3. Fixed Type Changes 4. Use Env variable, and health_forward (that handles network errors at init) 5. Unit Tests (#268) * init containers fix and other bug fixes (#269) * init container - KPI and kubeperf changes * changes * changes * changes * changes for empty array fix * changes * changes * pod inventory exception fix * nil check changes * changes * fixing typo * changes * changes * PR - feedback * remove comment * tag pass changes * changes * tagdrop changes * changes * changes * Send agg monitor signal on details change (#270) send when an agg monitor details change, but state did not change * bug fixes for error (#274) * Fix to use declaration and assignment instead of assignment (#275) * bug fixes for error * adding declaration to assignment * 1. Added telemetry (#277) 2. Configuration property changes 3. Bug fixes for a. unscheduled pods returning green 3b. Sometimes, the details hash of agg monitors are different because the order of elements inside the array is different, causing the records to be sent * Bug fix to remove unused variable (#281) * bug fixes for error * adding declaration to assignment * removing unused variable * Fix the WARN<->WARNING typo (#282) * Bug Fixes 1. telemetry send throwing exception if records not initialized 2. permissions error in on-prem clusters (#284) * Bug fixes 1. not writeable, telemetry error * Change to state_WS_dir * Fix Require relative revert (#287) * Bug Fixes for exceptions in telemetry, remove limit set check (#289) * Bug Fixes 10222019 * Initialize container_cpu_memory_records in fhmb * Added telemetry to investigate health exceptions * Set frozen_string_literal to true * Send event once per container when lookup is empty, or limit is an array * Unit Tests, Use RS and POD to determine workload * Fixed Node Condition Bug, added exception handling to return get_rs_owner_ref * Fix the bug where if a warning condition appears before fail condition, the node condition is reported as warning instead of fail. Also fix the node conditions state to consider unknown as a failure state (#292) * Fix for Nodes Aspect not showing up in draft cluster (#294) * Fix the issue where the health tree is inconsistent if a deployment is deleted (#295) * Rashmi/1 16 test (#297) * health deployment update * apps v1 changes for deployment * changes * changes to use relicasets and api groups * Fix duplicate records in container memory/cpu samples (#298) * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * fix exceptions (#306) * Merge Branch morgan into ci_feature (#308) * Fixes : 1) Disable health (for time being) - in DS & RS 2) Disable MDM (for time being) - in DS & RS 3) Merge kubeperf into kubenode & kubepod 4) Made scheduling predictable for kubenode & kubepod 5) Enable containerlog enrichment fields (timeofcommand, containername & containerimage) as a configurable setting (default = true/ON) - Also add telemetry for it 6) Filter OUT type!=Normal events for k8s events 7) AppInsights telemetry async 8) Fix double calling bug in in_win_cadvisor_perf 9) Add connect timeout (20secs) & read timeout (40 secs) for all cadvisor api calls & also for all kubernetes api server calls 10) Fix batchTime for kubepods to be one before making api server call (rather than after making the call, which will make it fluctuate based on api server latency for the call) * fix setting issue for the new enrichcontainerlog setting * fix compilation issue * fix another compilation issue * fix emit issues * fix a nil issue * fix mising tag * * Fix all input plugins for scheduling issue * Merge kubeservices with kubepodinventory (reduce RS to API server by one more) * Remove Kubelogs (not used) * Fix liveness probe * Disable enrichment by default for container logs * Move to yajl json parser across the board for docker provier code * Remove unused files * fix removed files * fix timeofcommand and remove a duplicate entry for a health file. * Rashmi/http leak fixes (#301) * changes for http connection close * close socket in ensure * adding nil check * Rashmi/http leak fixes (#303) * changes for http connection close * close socket in ensure * adding nil check * adding missing end * use yajl for events & nodes parsing. * Rashmi/http leak fixes (#304) * changes for http connection close * close socket in ensure * adding nil check * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * adding missing end * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * changes for chunking * telemetry changes * some fixes * bug fix * changing to have morgan changes only * add new line * use polltime for metrics and disable out_forward for health * enable mdm & health * few optimizations * do not remove time of command make kube.conf same as scale tested config * remove comments from container.conf * remove flush comment for ai telemetry * remove commented code lines * fix config * remove timeofcommand when enrichment==false * fix config * enable mdm filter * Rashmi/api chunk (#307) * changes * changes * refactor changes * changes * changes * changes * changes * node changes * changes * changes * changes * changes * adding open and read timeouts for api client * removing comments * updating chunk size * Update Readme * add back timeofcommand (#310) * Adding new cpu and memory limits to readme * CAdvisor to use 10255/10250 based on env variable (#321) * CAdvisor secure port changes (#320) * cadvsior secure port changes * update to use secure/insecure port for cadvisor * telemetry changes * fix bug * bug fix * changes * Adding cadvisor uri log * switching defaults * update readme * changes * changing font for code change and customer impact * For ARO, stop collecting inventory of master and infra (#323) * filter out infra and master nodes inventory for aro * filterout pods info scheduled master and infra nodes * fix redundant KubernetesApiClient name * filter out events sourced from master and infra nodes * fix in kubeapi * add the comments * fix pr feedback * minor updates * fix pr feedback * encode special characters in query * some refactoring * MDM plugin support for large scale clusters (#324) * Batch Commit * WIP: Committing move logic from filter to input * WIP : MDM plugins for scale clusters * Bug fixes 1. cpu percentage 2. bytesize on array. Remove log line * Fixing metric value in cadvisor2mdm plugin * WIP to laptop * Working version with cadvisor changes * Fix Health cpu usage * Added uri for cadvisor failure * Add Null check for kube api responses in in_kube_health (#325) * Fix casing bug (#326) * Missed kube.conf update (#327) * changes to use msi if service principal does not exist (#328) changes to use msi if service principal does not exist (#328) * Adding caseinsensitive compare (#330) Adding case insensitive compare * gpu monitoring (#329) * gpu monitoring * Emit info log for tests for the new insightsmetrics data stream Co-authored-by: Vishwanath <visnara@microsoft.com> Co-authored-by: Dilip Raghunathan <dilip.rangarajan@gmail.com> Co-authored-by: bragi92 <kadubey@microsoft.com> Co-authored-by: David Michelman <daweim0@gmail.com> Co-authored-by: ganga1980 <gangams@microsoft.com>
ganga1980
added a commit
that referenced
this pull request
Apr 17, 2020
* Updatng release history * fixing the plugin logs for emit stream * updating log message * Remove Log Processing from fluentd configuration * Remove plugin references from base_container.data * Dilipr/fluent bit log processing (#126) * Build out_oms.so and include in docker-cimprov package * Adding fluent-bit-config file to base container * PR Feedback * Adding out_oms.conf to base_container.data * PR Feedback * Making the critical section as small as possible * PR Feedback * Fixing the newline bug for Computer, and changing containerId to Id * Dilipr/glide updates (#127) * Updating glide.* files to include lumberjack * containerID="" for pull issues * Using KubeAPI for getting image,name. Adding more logs (#129) * Using KubeAPI for getting image,name. Adding more logs * Moving log file and state file to within the omsagent container * Changing log and state paths * Dilipr/mark comments (#130) * Marks Comments + Error Handling * Drop records from files that are not in k8s format * Remove unnecessary log line' * Adding Log to the file that doesn't conform to the expected format * Rashmi/segfault latest (#132) * adding null checks in all providers * fixing type * fixing type * adding more null checks * update cjson * Adding a missed null check (#135) * reusing some variables (#136) * Rashmi/cjson delete null check (#138) * adding null check for cjson-delete * null chk * removing null check * updating log level to debug for some provider workflows (#139) * Fixing CPU Utilization and removing Fluent-bit filters (#140) Removing fluent-bit filters, CPU optimizations * Minor tweaks 1. Remove some logging 2. Added more Error Handling 3. Continue when there is an error with k8s api (#141) * Removing some logs, added more error checking, continue on kube-api error * Return FLB OK for json Marshall error, instead of RETRY * * Change FluentBit flush interval to 30 secs (from 5 secs) * Remove ContainerPerf, ContainerServiceLog,ContainerProcess (OMI workflows) for Daemonset * Container Log Telemetry * Fixing an issue with Send Init Event if Telemetry is not initialized properly, tab to whitespace in conf file * PR feedback * PR feedback * Sending an event every 5 mins(Heartbeat) (#146) * PR feedback to cleanup removed workflows * updating agent version for telemetry * updating agent version * Telemetry Updates (#149) * Telemetry Fixes 1. Added Log Generation Rate 2. Fixed parsing bugs 3. Added code to send Exceptions/errors * PR Feedback * Changes to send omsagent/omsagent-rs kubectl logs to App Insights (#159) * Changes to send omsagent/omsagent-rs kubectl logs to App Insights * PR Feedback * Rashmi/fluentd docker inventory (#160) * first stab * changes * changes * docker util changes * working tested util * input plugin and conf * changes * changes * changes * changes * changes * working containerinventory * fixing omi removal from container.conf * removing comments * file write and read * deleted containers working * changes * changes * socket timeout * deleting test files * adding log * fixing comment * appinsights changes * changes * tel changes * changes * changes * changes * changes * lib changes * changes * changes * fixes * PR comments * changes * updating the ownership * changes * changes * changes to container data * removing comment * changes * adding collection time * bug fix * env string truncation * changes for acs-engine test * Fix Telemetry Bug -- Initialize Telemetry Client after Initializing all required properties (#162) * Fix kube events memory leak due to yaml serialization for > 5k events (#163) * Setting Timeout for HTTP Client in PostDataHelper in outoms go plugin(#164) * Vishwa/perftelemetry 2 (#165) * add cpu usage telemetry for ds & rs * add cpu & memory usage telemetry for ds & rs * environment variable fix (#166) * environment variable fix * updating agent version * Fixing a bug where we were crashing due to container statuses not present when not was lost (#167) * Updating title * updating right versions for last release * Updating the break condition to look for end of response (#168) * Updating the break condition to look for end of response * changes for docker response * updating AgentVersion for telemetry * Updating readme for latest release changes * Changes - (#173) * use /var/log for state * new metric ContainerLogsAgentSideLatencyMs * new field 'timeOfComand' * Rashmi/kubenodeinventory (#174) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * Get cpuusage from usageseconds (#175) * Rashmi/kubenodeinventory (#176) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * Rashmi/kubenodeinventory (#178) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * Fixing an issue on the cpurate metric, which happens for the first time (when cache is empty) (#179) * Rashmi/kubenodeinventory (#180) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Exclude docker containers from container inventory (#181) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * Exclude pauseamd64 containers from container inventory (#182) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * Update agent version * Updating readme for the latest release * Fix indentation in kube.conf and update readme (#184) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * fixing indentation so that kube.conf contents can be used in config map in the yaml * updating readme to fix date and agent version * updating agent tag * Get Pods for current Node Only (#185) * Fix KubeAPI Calls to filter to get pods for current node * Reinstate log line * changes for container node inventory fixed type (#186) * Fix for mooncake (disable telemetry optionally) (#191) * disable telemetry option * fix a typo * CustomMetrics to ci_feature (#193) Custom Metrics changes to ci_feature * add ContainerNotRunning column to KubePodInventory * merge pr feedback: update name to ContainerStatusReason * Zero Fill for Missing Pod Phases, Change Namespace Dimension to Kubernetes namespace, as it might be confused with metrics namespace in Metrics Explorer (#194) * Zero Fill for Pod Counts by Phase * Change namespace dimension to Kubernetes namespace * No Retries for non 404 4xx errors (#196) * Update agent version for telemetry * Update readme for upcoming (ciprod01202019) release * fix readme formatting * fix formatting for readme * fix formatting for readme * fix readme * fix readme * fix agent version for telemetry * fix date in readme * update readme * Restart logs every 10MB instead of weekly (#198) * Rotate logs every 10MB instead of weekly * Removing some logging, fixed log rotation * update agent version for telemetry * update readme * Update kube.conf to use %STATE_DIR_WS% instead of hardcoded path * Fix AKSEngine Crash (#200) * hotfix * close resp.Body * remove chatty logs * membuf=5m and ignore files not updated since 5 mins * fix readme for new version * Fix the pod count in mdm agent plugin (#203) * Update readme * string freeze for out_mdm plugin * Vishwa/resourcecentric (#208) * resourceid fix (for AKS only) * fix name * Rashmi/win nodepool - PR (#206) * changes for win nodes enumeration * changes * changes * changes * node cpu metric rate changes * container cpu rate * changes * changes * changes * changes * changes * changes to include in_win_cadvisor_perf.rb file * send containerinventoryheartbeatevent * changes * cahnges for mdm metrics * changes * cahnges * changes * container states * changes * changes * changes for env variables * changes * changes * changes * changes * delete comments * changes * mutex changes * changes * changes * changes * telemetry fix for docker version * removing hardcoded values for mdm * update docker version * telemetry for windows cadvisor timeouts * exeception key update to computer * PR comments * adding os to container inventory for windows nodes (#210) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors (#211) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors * Fixing the bug, deferring telemetry changes for later * updating to lowercase compare for units (#212) * Merge from vishwa/telegraftcp to ci_feature for telegraf changes (#214) * merge from Vishwa/telegraf to Vishwa/telegraftcp for telegraf changes (#207) * add configuration for telegraf * fix for perms * fix telegraf config. * fix file location & config * update to config * fix namespace * trying different namespace and also debug=true * add placeholder for nodename * change namespace * updated config * fix uri * fix azMon settings * remove aad settings * add custom metrics regions * fix config * add support for replica-set config * fix oomkilled * Add telegraf 403 metric telemetry & non 403 trace telemetry * fix type * fix package * fix package import * fix filename * delete unused file * conf file for rs; fix 403counttotal metric for telegraf, remove host and use nodeName consistently, rename metrics * fix statefulsets * fix typo. * fix another typo. * fix telemetry * fix casing issue * fix comma issue. * disable telemetry for rs ; fix stateful set name * worksround for namespace fix * telegraf integration - v1 * telemetry changes for telegraf * telemetry & other changes * remove custom metric regions as we dont need anymore * remove un-needed files * fixes * exclude certain volumes and fix telemetry to not have computer & nodename as dimensions (redundant) * Vishwa/resourcecentric (#208) (#209) * resourceid fix (for AKS only) * fix name * near final metric shape * change from customlog to fixed type (InsightsMetrics) * fix PR feedback * fix pr feedback * Fix telemetry error for telegraf err count metric (#215) * Fix Unscheduled Pod bug, remove excess telemetry (#218) * Fix Unscheduled Pod bug, remove excess telemetry * Send Success Telemetry only once after startup for a node in a cluster for MDM Post * Sending telemetry for successful push to MDM every hour * Merge from Vishwa/promstandardmetrics into ci_feature (#220) * enable prometheus metrics collection in replica-set * fixing typos * fix config file path for replicaset * fix configuration * config changes * merge config/settings to ci_feature (#221) * updating fluentbit to use LOG_TAIL_PATH * changes * log exclusion pattern * changes * removing comments * adding enviornment varibale collection/disable * disable env var for cluster variable change * changes * toml parser changes * adding directory tomlrb * changes for container inventory * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Telemetry for config overrides * add schema version telemetry * reduce the number of api calls for namespace filtering add more telemetry for config processing move liveness probe & parser to this repo * optimize for default kube-system namespace log collection exclusion * Fix Scenario when Controller name is empty (#222) * fix ; * ContainerLog collection optimizations (#223) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * merge final changes for release from Vishwa/june2019agentrel to ci_feature (#224) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * fix a minor comment * * change flush from 5 to 10 secs based on perf findings * fix fluent bit tuning for perf run (#226) * fix fluent bit tuning for perf run * stop collecting our own partition * fix merge issue * add release notes for june release in ci_feature branch * fix title * update * fix title * Trim spaces in AKS_REGION (#233) This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Add Logs Size To Telemetry (#234) * Add Logs to telemetry * Using len instead of unsafe.Sizeof * Merge Vishwa/promcustommetrics to ci_feature (#237) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * Fix Region space error (#239) * Trim spaces in AKS_REGION This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Fix out_mdm parsing error * Removing buffer chunk size and buffer max size from fluentbit conf (#240) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * removing buffer chunk size and buffer max size from fluentbit conf * changes (#243) * Collect container last state (#235) * updating the OMS agent to also collect container last state * changed a comment * git surrounded ContainerLastStatus code in a begin/rescue block * added a lot of error checking and logging * Rashmi/fix prom telemetry (#247) * fix prom telemetry * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Merge Health Model work into ci_feature behind a feature flag Pending perf testing (#246) Merge Health to ci_feature * Fix Deserialization Bug (#249) * Fix the bug where capacity is not updated and cached value was being used (#251) * Fix the Capacity computation * fix node cpu and memory limits calculation * changes (#250) * Added new Custom Metrics Regions, fixed MDM plugin crash bug (#253) Added new regions, added handler for MDM plugin start * Add Missing Handlers (#254) * Added Missing Handlers * Return MultiEventStream.new instead of empty array (#256) * Added explicit require_relative to avoid loading errors (#258) * Adding explicit require_relative * Gangams/enable ai telemetry in mc (#252) * enable ai telemetry to configure different ikey and endpoint per cloud * Fixing null check out_mdm bug, tomlparser bug, exposing Replica Set service name as an ENV variable (#261) * Expose replica set service as an env variable * Fixing null check out_mdm bug, and tomlparser bug * Updating the env variable name to be more specific to health model * Changes for creating custom plugins with namespace settings for prometheus scraping (#262) * changes * changes * changes * changes * changes * changes * chnages * changes * telemetry changes * changes * Cherry-pick hotfix 09092019 to ci_feature (#265) * Gangams/add telemetry hybrid (#264) * add telemetry to detect the cloud, distro and kernel version * add null check since providerId optional * detect azurestack cloud * rename to KubernetesProviderID since ProviderID name already used in LA * capture workspaceCloud to the telemetry * trim the domain read from file * KubeMonAgentEvents changes to collect configuration events (#267) * changes * changes * changes * changes * changes * changes * env changes * changes * changes * changes * reverting * changes * cahnges * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * chnages * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Fix the Dupe Perf Data Issue from the DaemonSet (#266) * Dupe Perf Record Fix * PR for 1. Container Memory CPU monitor 2. Configuration for Node Conditions 3. Fixed Type Changes 4. Use Env variable, and health_forward (that handles network errors at init) 5. Unit Tests (#268) * init containers fix and other bug fixes (#269) * init container - KPI and kubeperf changes * changes * changes * changes * changes for empty array fix * changes * changes * pod inventory exception fix * nil check changes * changes * fixing typo * changes * changes * PR - feedback * remove comment * tag pass changes * changes * tagdrop changes * changes * changes * Send agg monitor signal on details change (#270) send when an agg monitor details change, but state did not change * bug fixes for error (#274) * Fix to use declaration and assignment instead of assignment (#275) * bug fixes for error * adding declaration to assignment * 1. Added telemetry (#277) 2. Configuration property changes 3. Bug fixes for a. unscheduled pods returning green 3b. Sometimes, the details hash of agg monitors are different because the order of elements inside the array is different, causing the records to be sent * Bug fix to remove unused variable (#281) * bug fixes for error * adding declaration to assignment * removing unused variable * Fix the WARN<->WARNING typo (#282) * Bug Fixes 1. telemetry send throwing exception if records not initialized 2. permissions error in on-prem clusters (#284) * Bug fixes 1. not writeable, telemetry error * Change to state_WS_dir * Fix Require relative revert (#287) * Bug Fixes for exceptions in telemetry, remove limit set check (#289) * Bug Fixes 10222019 * Initialize container_cpu_memory_records in fhmb * Added telemetry to investigate health exceptions * Set frozen_string_literal to true * Send event once per container when lookup is empty, or limit is an array * Unit Tests, Use RS and POD to determine workload * Fixed Node Condition Bug, added exception handling to return get_rs_owner_ref * Fix the bug where if a warning condition appears before fail condition, the node condition is reported as warning instead of fail. Also fix the node conditions state to consider unknown as a failure state (#292) * Fix for Nodes Aspect not showing up in draft cluster (#294) * Fix the issue where the health tree is inconsistent if a deployment is deleted (#295) * Rashmi/1 16 test (#297) * health deployment update * apps v1 changes for deployment * changes * changes to use relicasets and api groups * Fix duplicate records in container memory/cpu samples (#298) * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * fix exceptions (#306) * Merge Branch morgan into ci_feature (#308) * Fixes : 1) Disable health (for time being) - in DS & RS 2) Disable MDM (for time being) - in DS & RS 3) Merge kubeperf into kubenode & kubepod 4) Made scheduling predictable for kubenode & kubepod 5) Enable containerlog enrichment fields (timeofcommand, containername & containerimage) as a configurable setting (default = true/ON) - Also add telemetry for it 6) Filter OUT type!=Normal events for k8s events 7) AppInsights telemetry async 8) Fix double calling bug in in_win_cadvisor_perf 9) Add connect timeout (20secs) & read timeout (40 secs) for all cadvisor api calls & also for all kubernetes api server calls 10) Fix batchTime for kubepods to be one before making api server call (rather than after making the call, which will make it fluctuate based on api server latency for the call) * fix setting issue for the new enrichcontainerlog setting * fix compilation issue * fix another compilation issue * fix emit issues * fix a nil issue * fix mising tag * * Fix all input plugins for scheduling issue * Merge kubeservices with kubepodinventory (reduce RS to API server by one more) * Remove Kubelogs (not used) * Fix liveness probe * Disable enrichment by default for container logs * Move to yajl json parser across the board for docker provier code * Remove unused files * fix removed files * fix timeofcommand and remove a duplicate entry for a health file. * Rashmi/http leak fixes (#301) * changes for http connection close * close socket in ensure * adding nil check * Rashmi/http leak fixes (#303) * changes for http connection close * close socket in ensure * adding nil check * adding missing end * use yajl for events & nodes parsing. * Rashmi/http leak fixes (#304) * changes for http connection close * close socket in ensure * adding nil check * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * adding missing end * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * changes for chunking * telemetry changes * some fixes * bug fix * changing to have morgan changes only * add new line * use polltime for metrics and disable out_forward for health * enable mdm & health * few optimizations * do not remove time of command make kube.conf same as scale tested config * remove comments from container.conf * remove flush comment for ai telemetry * remove commented code lines * fix config * remove timeofcommand when enrichment==false * fix config * enable mdm filter * Rashmi/api chunk (#307) * changes * changes * refactor changes * changes * changes * changes * changes * node changes * changes * changes * changes * changes * adding open and read timeouts for api client * removing comments * updating chunk size * Update Readme * add back timeofcommand (#310) * update readme for timeofcommand fix (#314) * Merge from ci_feature_prod into ci_feature (fix put back timeofcommand) (#311) (#316) * Updatng release history * fixing the plugin logs for emit stream * updating log message * Remove Log Processing from fluentd configuration * Remove plugin references from base_container.data * Dilipr/fluent bit log processing (#126) * Build out_oms.so and include in docker-cimprov package * Adding fluent-bit-config file to base container * PR Feedback * Adding out_oms.conf to base_container.data * PR Feedback * Making the critical section as small as possible * PR Feedback * Fixing the newline bug for Computer, and changing containerId to Id * Dilipr/glide updates (#127) * Updating glide.* files to include lumberjack * containerID="" for pull issues * Using KubeAPI for getting image,name. Adding more logs (#129) * Using KubeAPI for getting image,name. Adding more logs * Moving log file and state file to within the omsagent container * Changing log and state paths * Dilipr/mark comments (#130) * Marks Comments + Error Handling * Drop records from files that are not in k8s format * Remove unnecessary log line' * Adding Log to the file that doesn't conform to the expected format * Rashmi/segfault latest (#132) * adding null checks in all providers * fixing type * fixing type * adding more null checks * update cjson * Adding a missed null check (#135) * reusing some variables (#136) * Rashmi/cjson delete null check (#138) * adding null check for cjson-delete * null chk * removing null check * updating log level to debug for some provider workflows (#139) * Fixing CPU Utilization and removing Fluent-bit filters (#140) Removing fluent-bit filters, CPU optimizations * Minor tweaks 1. Remove some logging 2. Added more Error Handling 3. Continue when there is an error with k8s api (#141) * Removing some logs, added more error checking, continue on kube-api error * Return FLB OK for json Marshall error, instead of RETRY * * Change FluentBit flush interval to 30 secs (from 5 secs) * Remove ContainerPerf, ContainerServiceLog,ContainerProcess (OMI workflows) for Daemonset * Container Log Telemetry * Fixing an issue with Send Init Event if Telemetry is not initialized properly, tab to whitespace in conf file * PR feedback * PR feedback * Sending an event every 5 mins(Heartbeat) (#146) * PR feedback to cleanup removed workflows * updating agent version for telemetry * updating agent version * Telemetry Updates (#149) * Telemetry Fixes 1. Added Log Generation Rate 2. Fixed parsing bugs 3. Added code to send Exceptions/errors * PR Feedback * Changes to send omsagent/omsagent-rs kubectl logs to App Insights (#159) * Changes to send omsagent/omsagent-rs kubectl logs to App Insights * PR Feedback * Rashmi/fluentd docker inventory (#160) * first stab * changes * changes * docker util changes * working tested util * input plugin and conf * changes * changes * changes * changes * changes * working containerinventory * fixing omi removal from container.conf * removing comments * file write and read * deleted containers working * changes * changes * socket timeout * deleting test files * adding log * fixing comment * appinsights changes * changes * tel changes * changes * changes * changes * changes * lib changes * changes * changes * fixes * PR comments * changes * updating the ownership * changes * changes * changes to container data * removing comment * changes * adding collection time * bug fix * env string truncation * changes for acs-engine test * Fix Telemetry Bug -- Initialize Telemetry Client after Initializing all required properties (#162) * Fix kube events memory leak due to yaml serialization for > 5k events (#163) * Setting Timeout for HTTP Client in PostDataHelper in outoms go plugin(#164) * Vishwa/perftelemetry 2 (#165) * add cpu usage telemetry for ds & rs * add cpu & memory usage telemetry for ds & rs * environment variable fix (#166) * environment variable fix * updating agent version * Fixing a bug where we were crashing due to container statuses not present when not was lost (#167) * Updating title * updating right versions for last release * Updating the break condition to look for end of response (#168) * Updating the break condition to look for end of response * changes for docker response * updating AgentVersion for telemetry * Updating readme for latest release changes * Changes - (#173) * use /var/log for state * new metric ContainerLogsAgentSideLatencyMs * new field 'timeOfComand' * Rashmi/kubenodeinventory (#174) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * Get cpuusage from usageseconds (#175) * Rashmi/kubenodeinventory (#176) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * Rashmi/kubenodeinventory (#178) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * Fixing an issue on the cpurate metric, which happens for the first time (when cache is empty) (#179) * Rashmi/kubenodeinventory (#180) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Exclude docker containers from container inventory (#181) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * Exclude pauseamd64 containers from container inventory (#182) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * Update agent version * Updating readme for the latest release * Fix indentation in kube.conf and update readme (#184) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * fixing indentation so that kube.conf contents can be used in config map in the yaml * updating readme to fix date and agent version * updating agent tag * Get Pods for current Node Only (#185) * Fix KubeAPI Calls to filter to get pods for current node * Reinstate log line * changes for container node inventory fixed type (#186) * Fix for mooncake (disable telemetry optionally) (#191) * disable telemetry option * fix a typo * CustomMetrics to ci_feature (#193) Custom Metrics changes to ci_feature * add ContainerNotRunning column to KubePodInventory * merge pr feedback: update name to ContainerStatusReason * Zero Fill for Missing Pod Phases, Change Namespace Dimension to Kubernetes namespace, as it might be confused with metrics namespace in Metrics Explorer (#194) * Zero Fill for Pod Counts by Phase * Change namespace dimension to Kubernetes namespace * No Retries for non 404 4xx errors (#196) * Update agent version for telemetry * Update readme for upcoming (ciprod01202019) release * fix readme formatting * fix formatting for readme * fix formatting for readme * fix readme * fix readme * fix agent version for telemetry * fix date in readme * update readme * Restart logs every 10MB instead of weekly (#198) * Rotate logs every 10MB instead of weekly * Removing some logging, fixed log rotation * update agent version for telemetry * update readme * Update kube.conf to use %STATE_DIR_WS% instead of hardcoded path * Fix AKSEngine Crash (#200) * hotfix * close resp.Body * remove chatty logs * membuf=5m and ignore files not updated since 5 mins * fix readme for new version * Fix the pod count in mdm agent plugin (#203) * Update readme * string freeze for out_mdm plugin * Vishwa/resourcecentric (#208) * resourceid fix (for AKS only) * fix name * Rashmi/win nodepool - PR (#206) * changes for win nodes enumeration * changes * changes * changes * node cpu metric rate changes * container cpu rate * changes * changes * changes * changes * changes * changes to include in_win_cadvisor_perf.rb file * send containerinventoryheartbeatevent * changes * cahnges for mdm metrics * changes * cahnges * changes * container states * changes * changes * changes for env variables * changes * changes * changes * changes * delete comments * changes * mutex changes * changes * changes * changes * telemetry fix for docker version * removing hardcoded values for mdm * update docker version * telemetry for windows cadvisor timeouts * exeception key update to computer * PR comments * adding os to container inventory for windows nodes (#210) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors (#211) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors * Fixing the bug, deferring telemetry changes for later * updating to lowercase compare for units (#212) * Merge from vishwa/telegraftcp to ci_feature for telegraf changes (#214) * merge from Vishwa/telegraf to Vishwa/telegraftcp for telegraf changes (#207) * add configuration for telegraf * fix for perms * fix telegraf config. * fix file location & config * update to config * fix namespace * trying different namespace and also debug=true * add placeholder for nodename * change namespace * updated config * fix uri * fix azMon settings * remove aad settings * add custom metrics regions * fix config * add support for replica-set config * fix oomkilled * Add telegraf 403 metric telemetry & non 403 trace telemetry * fix type * fix package * fix package import * fix filename * delete unused file * conf file for rs; fix 403counttotal metric for telegraf, remove host and use nodeName consistently, rename metrics * fix statefulsets * fix typo. * fix another typo. * fix telemetry * fix casing issue * fix comma issue. * disable telemetry for rs ; fix stateful set name * worksround for namespace fix * telegraf integration - v1 * telemetry changes for telegraf * telemetry & other changes * remove custom metric regions as we dont need anymore * remove un-needed files * fixes * exclude certain volumes and fix telemetry to not have computer & nodename as dimensions (redundant) * Vishwa/resourcecentric (#208) (#209) * resourceid fix (for AKS only) * fix name * near final metric shape * change from customlog to fixed type (InsightsMetrics) * fix PR feedback * fix pr feedback * Fix telemetry error for telegraf err count metric (#215) * Fix Unscheduled Pod bug, remove excess telemetry (#218) * Fix Unscheduled Pod bug, remove excess telemetry * Send Success Telemetry only once after startup for a node in a cluster for MDM Post * Sending telemetry for successful push to MDM every hour * Merge from Vishwa/promstandardmetrics into ci_feature (#220) * enable prometheus metrics collection in replica-set * fixing typos * fix config file path for replicaset * fix configuration * config changes * merge config/settings to ci_feature (#221) * updating fluentbit to use LOG_TAIL_PATH * changes * log exclusion pattern * changes * removing comments * adding enviornment varibale collection/disable * disable env var for cluster variable change * changes * toml parser changes * adding directory tomlrb * changes for container inventory * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Telemetry for config overrides * add schema version telemetry * reduce the number of api calls for namespace filtering add more telemetry for config processing move liveness probe & parser to this repo * optimize for default kube-system namespace log collection exclusion * Fix Scenario when Controller name is empty (#222) * fix ; * ContainerLog collection optimizations (#223) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * merge final changes for release from Vishwa/june2019agentrel to ci_feature (#224) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * fix a minor comment * * change flush from 5 to 10 secs based on perf findings * fix fluent bit tuning for perf run (#226) * fix fluent bit tuning for perf run * stop collecting our own partition * fix merge issue * add release notes for june release in ci_feature branch * fix title * update * fix title * Trim spaces in AKS_REGION (#233) This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Add Logs Size To Telemetry (#234) * Add Logs to telemetry * Using len instead of unsafe.Sizeof * Merge Vishwa/promcustommetrics to ci_feature (#237) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * Fix Region space error (#239) * Trim spaces in AKS_REGION This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Fix out_mdm parsing error * Removing buffer chunk size and buffer max size from fluentbit conf (#240) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * removing buffer chunk size and buffer max size from fluentbit conf * changes (#243) * Collect container last state (#235) * updating the OMS agent to also collect container last state * changed a comment * git surrounded ContainerLastStatus code in a begin/rescue block * added a lot of error checking and logging * Rashmi/fix prom telemetry (#247) * fix prom telemetry * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Merge Health Model work into ci_feature behind a feature flag Pending perf testing (#246) Merge Health to ci_feature * Fix Deserialization Bug (#249) * Fix the bug where capacity is not updated and cached value was being used (#251) * Fix the Capacity computation * fix node cpu and memory limits calculation * changes (#250) * Added new Custom Metrics Regions, fixed MDM plugin crash bug (#253) Added new regions, added handler for MDM plugin start * Add Missing Handlers (#254) * Added Missing Handlers * Return MultiEventStream.new instead of empty array (#256) * Added explicit require_relative to avoid loading errors (#258) * Adding explicit require_relative * Gangams/enable ai telemetry in mc (#252) * enable ai telemetry to configure different ikey and endpoint per cloud * Fixing null check out_mdm bug, tomlparser bug, exposing Replica Set service name as an ENV variable (#261) * Expose replica set service as an env variable * Fixing null check out_mdm bug, and tomlparser bug * Updating the env variable name to be more specific to health model * Changes for creating custom plugins with namespace settings for prometheus scraping (#262) * changes * changes * changes * changes * changes * changes * chnages * changes * telemetry changes * changes * Cherry-pick hotfix 09092019 to ci_feature (#265) * Gangams/add telemetry hybrid (#264) * add telemetry to detect the cloud, distro and kernel version * add null check since providerId optional * detect azurestack cloud * rename to KubernetesProviderID since ProviderID name already used in LA * capture workspaceCloud to the telemetry * trim the domain read from file * KubeMonAgentEvents changes to collect configuration events (#267) * changes * changes * changes * changes * changes * changes * env changes * changes * changes * changes * reverting * changes * cahnges * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * chnages * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Fix the Dupe Perf Data Issue from the DaemonSet (#266) * Dupe Perf Record Fix * PR for 1. Container Memory CPU monitor 2. Configuration for Node Conditions 3. Fixed Type Changes 4. Use Env variable, and health_forward (that handles network errors at init) 5. Unit Tests (#268) * init containers fix and other bug fixes (#269) * init container - KPI and kubeperf changes * changes * changes * changes * changes for empty array fix * changes * changes * pod inventory exception fix * nil check changes * changes * fixing typo * changes * changes * PR - feedback * remove comment * tag pass changes * changes * tagdrop changes * changes * changes * Send agg monitor signal on details change (#270) send when an agg monitor details change, but state did not change * bug fixes for error (#274) * Fix to use declaration and assignment instead of assignment (#275) * bug fixes for error * adding declaration to assignment * 1. Added telemetry (#277) 2. Configuration property changes 3. Bug fixes for a. unscheduled pods returning green 3b. Sometimes, the details hash of agg monitors are different because the order of elements inside the array is different, causing the records to be sent * Bug fix to remove unused variable (#281) * bug fixes for error * adding declaration to assignment * removing unused variable * Fix the WARN<->WARNING typo (#282) * Bug Fixes 1. telemetry send throwing exception if records not initialized 2. permissions error in on-prem clusters (#284) * Bug fixes 1. not writeable, telemetry error * Change to state_WS_dir * Fix Require relative revert (#287) * Bug Fixes for exceptions in telemetry, remove limit set check (#289) * Bug Fixes 10222019 * Initialize container_cpu_memory_records in fhmb * Added telemetry to investigate health exceptions * Set frozen_string_literal to true * Send event once per container when lookup is empty, or limit is an array * Unit Tests, Use RS and POD to determine workload * Fixed Node Condition Bug, added exception handling to return get_rs_owner_ref * Fix the bug where if a warning condition appears before fail condition, the node condition is reported as warning instead of fail. Also fix the node conditions state to consider unknown as a failure state (#292) * Fix for Nodes Aspect not showing up in draft cluster (#294) * Fix the issue where the health tree is inconsistent if a deployment is deleted (#295) * Rashmi/1 16 test (#297) * health deployment update * apps v1 changes for deployment * changes * changes to use relicasets and api groups * Fix duplicate records in container memory/cpu samples (#298) * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * fix exceptions (#306) * Merge Branch morgan into ci_feature (#308) * Fixes : 1) Disable health (for time being) - in DS & RS 2) Disable MDM (for time being) - in DS & RS 3) Merge kubeperf into kubenode & kubepod 4) Made scheduling predictable for kubenode & kubepod 5) Enable containerlog enrichment fields (timeofcommand, containername & containerimage) as a configurable setting (default = true/ON) - Also add telemetry for it 6) Filter OUT type!=Normal events for k8s events 7) AppInsights telemetry async 8) Fix double calling bug in in_win_cadvisor_perf 9) Add connect timeout (20secs) & read timeout (40 secs) for all cadvisor api calls & also for all kubernetes api server calls 10) Fix batchTime for kubepods to be one before making api server call (rather than after making the call, which will make it fluctuate based on api server latency for the call) * fix setting issue for the new enrichcontainerlog setting * fix compilation issue * fix another compilation issue * fix emit issues * fix a nil issue * fix mising tag * * Fix all input plugins for scheduling issue * Merge kubeservices with kubepodinventory (reduce RS to API server by one more) * Remove Kubelogs (not used) * Fix liveness probe * Disable enrichment by default for container logs * Move to yajl json parser across the board for docker provier code * Remove unused files * fix removed files * fix timeofcommand and remove a duplicate entry for a health file. * Rashmi/http leak fixes (#301) * changes for http connection close * close socket in ensure * adding nil check * Rashmi/http leak fixes (#303) * changes for http connection close * close socket in ensure * adding nil check * adding missing end * use yajl for events & nodes parsing. * Rashmi/http leak fixes (#304) * changes for http connection close * close socket in ensure * adding nil check * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * adding missing end * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * changes for chunking * telemetry changes * some fixes * bug fix * changing to have morgan changes only * add new line * use polltime for metrics and disable out_forward for health * enable mdm & health * few optimizations * do not remove time of command make kube.conf same as scale tested config * remove comments from container.conf * remove flush comment for ai telemetry * remove commented code lines * fix config * remove timeofcommand when enrichment==false * fix config * enable mdm filter * Rashmi/api chunk (#307) * changes * changes * refactor changes * changes * changes * changes * changes * node changes * changes * changes * changes * changes * adding open and read timeouts for api client * removing comments * updating chunk size * Update Readme * add back timeofcommand (#310) * Adding new cpu and memory limits to readme * CAdvisor to use 10255/10250 based on env variable (#321) * CAdvisor secure port changes (#320) * cadvsior secure port changes * update to use secure/insecure port for cadvisor * telemetry changes * fix bug * bug fix * changes * Adding cadvisor uri log * switching defaults * update readme * changes * changing font for code change and customer impact * For ARO, stop collecting inventory of master and infra (#323) * filter out infra and master nodes inventory for aro * filterout pods info scheduled master and infra nodes * fix redundant KubernetesApiClient name * filter out events sourced from master and infra nodes * fix in kubeapi * add the comments * fix pr feedback * minor updates * fix pr feedback * encode special characters in query * some refactoring * MDM plugin support for large scale clusters (#324) * Batch Commit * WIP: Committing move logic from filter to input * WIP : MDM plugins for scale clusters * Bug fixes 1. cpu percentage 2. bytesize on array. Remove log line * Fixing metric value in cadvisor2mdm plugin * WIP to laptop * Working version with cadvisor changes * Fix Health cpu usage * Added uri for cadvisor failure * Add Null check for kube api responses in in_kube_health (#325) * Fix casing bug (#326) * Missed kube.conf update (#327) * changes to use msi if service principal does not exist (#328) changes to use msi if service principal does not exist (#328) * Adding caseinsensitive compare (#330) Adding case insensitive compare * gpu monitoring (#329) * gpu monitoring * Emit info log for tests for the new insightsmetrics data stream * Update release notes * MDM batch Bug (#336) * kube evnts bug fix (#335) * Update readme.md * Rate Limiting changes (#338) * Add header for throttling Add telemetry for throttling Flush 15secs for telegraf metrics * * Add requestid & log it for errors * Fix bug in semver * Gangams/add support cri runtime docker env (#337) * trim log tag for cri compatible log lines * fix variable declaration * debug log messages * trim spaces * remove debug logs * fix build error * add pods api in cadvisor * wip * wip * wip * wip * fix bug * fix bug * fix bug with end * fix bug in the reference * fix bug * fix telemetry * fix image and imageid bug * fix containerid bug * fix envvars * fix syntax error * fix syntax error * fix env and command * handle labels and annotation fieldref as envs * wip * implement obtain envvars * refactor the code * fix log warn * add more logging * fix npe * revert kubelet api to get the envvars * fix minor issue * fix log message * comment valueFrom for now since this causing some issue * fix formatting issue * fix bug * add resourcefieldref support in env * include init containers as well * fix bug * fix typo * use chomp instead of delete_suffix * add secretkeyref support * add secretkeyref support * fix bug * handle dockerversion for non-docker runtimes * fix comment * handle kubelet metrics to handle runtime and k8s versions * remove _total metrics until we support 1.18 * fetch envvars via proc environ * fix undefined vars * fix undefined vars * add init containers * split on null character * consider crio containers main proceses for envvars * unschedule pods or containers with pull issues * fix issue * fix feedback * update to azmon parsers * fix file extension * add some debug messages * fix timestamp issue * fix pr feedback * refactor linux containerenvvars code * refactor windows container inventory code * fix rb file load error from in_kube_node inventory * add new rb file base_container.data * turnoff docker container inventory * cleanup debug logs * fix npe in log message * avoid environ for terminated containers * fix pr feedback * remove empty file * clean up debug logs * Gangams/fix cri exceptions (#339) * fix nil exception * add exception logging message * fix npe * fix log exception (#340) * Updating release notes, mdm bug fix (#333) (#343) Updating release notes, mdm bug fix for release Co-authored-by: rashmichandrashekar <rashmy@microsoft.com> Co-authored-by: Vishwanath Narasimhan <visnara@microsoft.com> Co-authored-by: rashmy <rashmy@RASHMY-PC2> Co-authored-by: r-dilip <dilip.rangarajan@gmail.com> Co-authored-by: rashmichandrashekar <rashmy@microsoft.com> Co-authored-by: Kaveesh Dubey <kadubey@microsoft.com> Co-authored-by: David Michelman <daweim0@gmail.com> Co-authored-by: Dilip Raghunathan <dilipr@microsoft.com>
ganga1980
added a commit
that referenced
this pull request
Apr 22, 2020
* Updatng release history * fixing the plugin logs for emit stream * updating log message * Remove Log Processing from fluentd configuration * Remove plugin references from base_container.data * Dilipr/fluent bit log processing (#126) * Build out_oms.so and include in docker-cimprov package * Adding fluent-bit-config file to base container * PR Feedback * Adding out_oms.conf to base_container.data * PR Feedback * Making the critical section as small as possible * PR Feedback * Fixing the newline bug for Computer, and changing containerId to Id * Dilipr/glide updates (#127) * Updating glide.* files to include lumberjack * containerID="" for pull issues * Using KubeAPI for getting image,name. Adding more logs (#129) * Using KubeAPI for getting image,name. Adding more logs * Moving log file and state file to within the omsagent container * Changing log and state paths * Dilipr/mark comments (#130) * Marks Comments + Error Handling * Drop records from files that are not in k8s format * Remove unnecessary log line' * Adding Log to the file that doesn't conform to the expected format * Rashmi/segfault latest (#132) * adding null checks in all providers * fixing type * fixing type * adding more null checks * update cjson * Adding a missed null check (#135) * reusing some variables (#136) * Rashmi/cjson delete null check (#138) * adding null check for cjson-delete * null chk * removing null check * updating log level to debug for some provider workflows (#139) * Fixing CPU Utilization and removing Fluent-bit filters (#140) Removing fluent-bit filters, CPU optimizations * Minor tweaks 1. Remove some logging 2. Added more Error Handling 3. Continue when there is an error with k8s api (#141) * Removing some logs, added more error checking, continue on kube-api error * Return FLB OK for json Marshall error, instead of RETRY * * Change FluentBit flush interval to 30 secs (from 5 secs) * Remove ContainerPerf, ContainerServiceLog,ContainerProcess (OMI workflows) for Daemonset * Container Log Telemetry * Fixing an issue with Send Init Event if Telemetry is not initialized properly, tab to whitespace in conf file * PR feedback * PR feedback * Sending an event every 5 mins(Heartbeat) (#146) * PR feedback to cleanup removed workflows * updating agent version for telemetry * updating agent version * Telemetry Updates (#149) * Telemetry Fixes 1. Added Log Generation Rate 2. Fixed parsing bugs 3. Added code to send Exceptions/errors * PR Feedback * Changes to send omsagent/omsagent-rs kubectl logs to App Insights (#159) * Changes to send omsagent/omsagent-rs kubectl logs to App Insights * PR Feedback * Rashmi/fluentd docker inventory (#160) * first stab * changes * changes * docker util changes * working tested util * input plugin and conf * changes * changes * changes * changes * changes * working containerinventory * fixing omi removal from container.conf * removing comments * file write and read * deleted containers working * changes * changes * socket timeout * deleting test files * adding log * fixing comment * appinsights changes * changes * tel changes * changes * changes * changes * changes * lib changes * changes * changes * fixes * PR comments * changes * updating the ownership * changes * changes * changes to container data * removing comment * changes * adding collection time * bug fix * env string truncation * changes for acs-engine test * Fix Telemetry Bug -- Initialize Telemetry Client after Initializing all required properties (#162) * Fix kube events memory leak due to yaml serialization for > 5k events (#163) * Setting Timeout for HTTP Client in PostDataHelper in outoms go plugin(#164) * Vishwa/perftelemetry 2 (#165) * add cpu usage telemetry for ds & rs * add cpu & memory usage telemetry for ds & rs * environment variable fix (#166) * environment variable fix * updating agent version * Fixing a bug where we were crashing due to container statuses not present when not was lost (#167) * Updating title * updating right versions for last release * Updating the break condition to look for end of response (#168) * Updating the break condition to look for end of response * changes for docker response * updating AgentVersion for telemetry * Updating readme for latest release changes * Changes - (#173) * use /var/log for state * new metric ContainerLogsAgentSideLatencyMs * new field 'timeOfComand' * Rashmi/kubenodeinventory (#174) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * Get cpuusage from usageseconds (#175) * Rashmi/kubenodeinventory (#176) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * Rashmi/kubenodeinventory (#178) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * Fixing an issue on the cpurate metric, which happens for the first time (when cache is empty) (#179) * Rashmi/kubenodeinventory (#180) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Exclude docker containers from container inventory (#181) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * Exclude pauseamd64 containers from container inventory (#182) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * Update agent version * Updating readme for the latest release * Fix indentation in kube.conf and update readme (#184) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * fixing indentation so that kube.conf contents can be used in config map in the yaml * updating readme to fix date and agent version * updating agent tag * Get Pods for current Node Only (#185) * Fix KubeAPI Calls to filter to get pods for current node * Reinstate log line * changes for container node inventory fixed type (#186) * Fix for mooncake (disable telemetry optionally) (#191) * disable telemetry option * fix a typo * CustomMetrics to ci_feature (#193) Custom Metrics changes to ci_feature * add ContainerNotRunning column to KubePodInventory * merge pr feedback: update name to ContainerStatusReason * Zero Fill for Missing Pod Phases, Change Namespace Dimension to Kubernetes namespace, as it might be confused with metrics namespace in Metrics Explorer (#194) * Zero Fill for Pod Counts by Phase * Change namespace dimension to Kubernetes namespace * No Retries for non 404 4xx errors (#196) * Update agent version for telemetry * Update readme for upcoming (ciprod01202019) release * fix readme formatting * fix formatting for readme * fix formatting for readme * fix readme * fix readme * fix agent version for telemetry * fix date in readme * update readme * Restart logs every 10MB instead of weekly (#198) * Rotate logs every 10MB instead of weekly * Removing some logging, fixed log rotation * update agent version for telemetry * update readme * Update kube.conf to use %STATE_DIR_WS% instead of hardcoded path * Fix AKSEngine Crash (#200) * hotfix * close resp.Body * remove chatty logs * membuf=5m and ignore files not updated since 5 mins * fix readme for new version * Fix the pod count in mdm agent plugin (#203) * Update readme * string freeze for out_mdm plugin * Vishwa/resourcecentric (#208) * resourceid fix (for AKS only) * fix name * Rashmi/win nodepool - PR (#206) * changes for win nodes enumeration * changes * changes * changes * node cpu metric rate changes * container cpu rate * changes * changes * changes * changes * changes * changes to include in_win_cadvisor_perf.rb file * send containerinventoryheartbeatevent * changes * cahnges for mdm metrics * changes * cahnges * changes * container states * changes * changes * changes for env variables * changes * changes * changes * changes * delete comments * changes * mutex changes * changes * changes * changes * telemetry fix for docker version * removing hardcoded values for mdm * update docker version * telemetry for windows cadvisor timeouts * exeception key update to computer * PR comments * adding os to container inventory for windows nodes (#210) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors (#211) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors * Fixing the bug, deferring telemetry changes for later * updating to lowercase compare for units (#212) * Merge from vishwa/telegraftcp to ci_feature for telegraf changes (#214) * merge from Vishwa/telegraf to Vishwa/telegraftcp for telegraf changes (#207) * add configuration for telegraf * fix for perms * fix telegraf config. * fix file location & config * update to config * fix namespace * trying different namespace and also debug=true * add placeholder for nodename * change namespace * updated config * fix uri * fix azMon settings * remove aad settings * add custom metrics regions * fix config * add support for replica-set config * fix oomkilled * Add telegraf 403 metric telemetry & non 403 trace telemetry * fix type * fix package * fix package import * fix filename * delete unused file * conf file for rs; fix 403counttotal metric for telegraf, remove host and use nodeName consistently, rename metrics * fix statefulsets * fix typo. * fix another typo. * fix telemetry * fix casing issue * fix comma issue. * disable telemetry for rs ; fix stateful set name * worksround for namespace fix * telegraf integration - v1 * telemetry changes for telegraf * telemetry & other changes * remove custom metric regions as we dont need anymore * remove un-needed files * fixes * exclude certain volumes and fix telemetry to not have computer & nodename as dimensions (redundant) * Vishwa/resourcecentric (#208) (#209) * resourceid fix (for AKS only) * fix name * near final metric shape * change from customlog to fixed type (InsightsMetrics) * fix PR feedback * fix pr feedback * Fix telemetry error for telegraf err count metric (#215) * Fix Unscheduled Pod bug, remove excess telemetry (#218) * Fix Unscheduled Pod bug, remove excess telemetry * Send Success Telemetry only once after startup for a node in a cluster for MDM Post * Sending telemetry for successful push to MDM every hour * Merge from Vishwa/promstandardmetrics into ci_feature (#220) * enable prometheus metrics collection in replica-set * fixing typos * fix config file path for replicaset * fix configuration * config changes * merge config/settings to ci_feature (#221) * updating fluentbit to use LOG_TAIL_PATH * changes * log exclusion pattern * changes * removing comments * adding enviornment varibale collection/disable * disable env var for cluster variable change * changes * toml parser changes * adding directory tomlrb * changes for container inventory * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Telemetry for config overrides * add schema version telemetry * reduce the number of api calls for namespace filtering add more telemetry for config processing move liveness probe & parser to this repo * optimize for default kube-system namespace log collection exclusion * Fix Scenario when Controller name is empty (#222) * fix ; * ContainerLog collection optimizations (#223) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * merge final changes for release from Vishwa/june2019agentrel to ci_feature (#224) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * fix a minor comment * * change flush from 5 to 10 secs based on perf findings * fix fluent bit tuning for perf run (#226) * fix fluent bit tuning for perf run * stop collecting our own partition * fix merge issue * add release notes for june release in ci_feature branch * fix title * update * fix title * Trim spaces in AKS_REGION (#233) This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Add Logs Size To Telemetry (#234) * Add Logs to telemetry * Using len instead of unsafe.Sizeof * Merge Vishwa/promcustommetrics to ci_feature (#237) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * Fix Region space error (#239) * Trim spaces in AKS_REGION This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Fix out_mdm parsing error * Removing buffer chunk size and buffer max size from fluentbit conf (#240) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * removing buffer chunk size and buffer max size from fluentbit conf * changes (#243) * Collect container last state (#235) * updating the OMS agent to also collect container last state * changed a comment * git surrounded ContainerLastStatus code in a begin/rescue block * added a lot of error checking and logging * Rashmi/fix prom telemetry (#247) * fix prom telemetry * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Merge Health Model work into ci_feature behind a feature flag Pending perf testing (#246) Merge Health to ci_feature * Fix Deserialization Bug (#249) * Fix the bug where capacity is not updated and cached value was being used (#251) * Fix the Capacity computation * fix node cpu and memory limits calculation * changes (#250) * Added new Custom Metrics Regions, fixed MDM plugin crash bug (#253) Added new regions, added handler for MDM plugin start * Add Missing Handlers (#254) * Added Missing Handlers * Return MultiEventStream.new instead of empty array (#256) * Added explicit require_relative to avoid loading errors (#258) * Adding explicit require_relative * Gangams/enable ai telemetry in mc (#252) * enable ai telemetry to configure different ikey and endpoint per cloud * Fixing null check out_mdm bug, tomlparser bug, exposing Replica Set service name as an ENV variable (#261) * Expose replica set service as an env variable * Fixing null check out_mdm bug, and tomlparser bug * Updating the env variable name to be more specific to health model * Changes for creating custom plugins with namespace settings for prometheus scraping (#262) * changes * changes * changes * changes * changes * changes * chnages * changes * telemetry changes * changes * Cherry-pick hotfix 09092019 to ci_feature (#265) * Gangams/add telemetry hybrid (#264) * add telemetry to detect the cloud, distro and kernel version * add null check since providerId optional * detect azurestack cloud * rename to KubernetesProviderID since ProviderID name already used in LA * capture workspaceCloud to the telemetry * trim the domain read from file * KubeMonAgentEvents changes to collect configuration events (#267) * changes * changes * changes * changes * changes * changes * env changes * changes * changes * changes * reverting * changes * cahnges * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * chnages * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Fix the Dupe Perf Data Issue from the DaemonSet (#266) * Dupe Perf Record Fix * PR for 1. Container Memory CPU monitor 2. Configuration for Node Conditions 3. Fixed Type Changes 4. Use Env variable, and health_forward (that handles network errors at init) 5. Unit Tests (#268) * init containers fix and other bug fixes (#269) * init container - KPI and kubeperf changes * changes * changes * changes * changes for empty array fix * changes * changes * pod inventory exception fix * nil check changes * changes * fixing typo * changes * changes * PR - feedback * remove comment * tag pass changes * changes * tagdrop changes * changes * changes * Send agg monitor signal on details change (#270) send when an agg monitor details change, but state did not change * bug fixes for error (#274) * Fix to use declaration and assignment instead of assignment (#275) * bug fixes for error * adding declaration to assignment * 1. Added telemetry (#277) 2. Configuration property changes 3. Bug fixes for a. unscheduled pods returning green 3b. Sometimes, the details hash of agg monitors are different because the order of elements inside the array is different, causing the records to be sent * Bug fix to remove unused variable (#281) * bug fixes for error * adding declaration to assignment * removing unused variable * Fix the WARN<->WARNING typo (#282) * Bug Fixes 1. telemetry send throwing exception if records not initialized 2. permissions error in on-prem clusters (#284) * Bug fixes 1. not writeable, telemetry error * Change to state_WS_dir * Fix Require relative revert (#287) * Bug Fixes for exceptions in telemetry, remove limit set check (#289) * Bug Fixes 10222019 * Initialize container_cpu_memory_records in fhmb * Added telemetry to investigate health exceptions * Set frozen_string_literal to true * Send event once per container when lookup is empty, or limit is an array * Unit Tests, Use RS and POD to determine workload * Fixed Node Condition Bug, added exception handling to return get_rs_owner_ref * Fix the bug where if a warning condition appears before fail condition, the node condition is reported as warning instead of fail. Also fix the node conditions state to consider unknown as a failure state (#292) * Fix for Nodes Aspect not showing up in draft cluster (#294) * Fix the issue where the health tree is inconsistent if a deployment is deleted (#295) * Rashmi/1 16 test (#297) * health deployment update * apps v1 changes for deployment * changes * changes to use relicasets and api groups * Fix duplicate records in container memory/cpu samples (#298) * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * fix exceptions (#306) * Merge Branch morgan into ci_feature (#308) * Fixes : 1) Disable health (for time being) - in DS & RS 2) Disable MDM (for time being) - in DS & RS 3) Merge kubeperf into kubenode & kubepod 4) Made scheduling predictable for kubenode & kubepod 5) Enable containerlog enrichment fields (timeofcommand, containername & containerimage) as a configurable setting (default = true/ON) - Also add telemetry for it 6) Filter OUT type!=Normal events for k8s events 7) AppInsights telemetry async 8) Fix double calling bug in in_win_cadvisor_perf 9) Add connect timeout (20secs) & read timeout (40 secs) for all cadvisor api calls & also for all kubernetes api server calls 10) Fix batchTime for kubepods to be one before making api server call (rather than after making the call, which will make it fluctuate based on api server latency for the call) * fix setting issue for the new enrichcontainerlog setting * fix compilation issue * fix another compilation issue * fix emit issues * fix a nil issue * fix mising tag * * Fix all input plugins for scheduling issue * Merge kubeservices with kubepodinventory (reduce RS to API server by one more) * Remove Kubelogs (not used) * Fix liveness probe * Disable enrichment by default for container logs * Move to yajl json parser across the board for docker provier code * Remove unused files * fix removed files * fix timeofcommand and remove a duplicate entry for a health file. * Rashmi/http leak fixes (#301) * changes for http connection close * close socket in ensure * adding nil check * Rashmi/http leak fixes (#303) * changes for http connection close * close socket in ensure * adding nil check * adding missing end * use yajl for events & nodes parsing. * Rashmi/http leak fixes (#304) * changes for http connection close * close socket in ensure * adding nil check * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * adding missing end * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * changes for chunking * telemetry changes * some fixes * bug fix * changing to have morgan changes only * add new line * use polltime for metrics and disable out_forward for health * enable mdm & health * few optimizations * do not remove time of command make kube.conf same as scale tested config * remove comments from container.conf * remove flush comment for ai telemetry * remove commented code lines * fix config * remove timeofcommand when enrichment==false * fix config * enable mdm filter * Rashmi/api chunk (#307) * changes * changes * refactor changes * changes * changes * changes * changes * node changes * changes * changes * changes * changes * adding open and read timeouts for api client * removing comments * updating chunk size * Update Readme * add back timeofcommand (#310) * update readme for timeofcommand fix (#314) * Merge from ci_feature_prod into ci_feature (fix put back timeofcommand) (#311) (#316) * Updatng release history * fixing the plugin logs for emit stream * updating log message * Remove Log Processing from fluentd configuration * Remove plugin references from base_container.data * Dilipr/fluent bit log processing (#126) * Build out_oms.so and include in docker-cimprov package * Adding fluent-bit-config file to base container * PR Feedback * Adding out_oms.conf to base_container.data * PR Feedback * Making the critical section as small as possible * PR Feedback * Fixing the newline bug for Computer, and changing containerId to Id * Dilipr/glide updates (#127) * Updating glide.* files to include lumberjack * containerID="" for pull issues * Using KubeAPI for getting image,name. Adding more logs (#129) * Using KubeAPI for getting image,name. Adding more logs * Moving log file and state file to within the omsagent container * Changing log and state paths * Dilipr/mark comments (#130) * Marks Comments + Error Handling * Drop records from files that are not in k8s format * Remove unnecessary log line' * Adding Log to the file that doesn't conform to the expected format * Rashmi/segfault latest (#132) * adding null checks in all providers * fixing type * fixing type * adding more null checks * update cjson * Adding a missed null check (#135) * reusing some variables (#136) * Rashmi/cjson delete null check (#138) * adding null check for cjson-delete * null chk * removing null check * updating log level to debug for some provider workflows (#139) * Fixing CPU Utilization and removing Fluent-bit filters (#140) Removing fluent-bit filters, CPU optimizations * Minor tweaks 1. Remove some logging 2. Added more Error Handling 3. Continue when there is an error with k8s api (#141) * Removing some logs, added more error checking, continue on kube-api error * Return FLB OK for json Marshall error, instead of RETRY * * Change FluentBit flush interval to 30 secs (from 5 secs) * Remove ContainerPerf, ContainerServiceLog,ContainerProcess (OMI workflows) for Daemonset * Container Log Telemetry * Fixing an issue with Send Init Event if Telemetry is not initialized properly, tab to whitespace in conf file * PR feedback * PR feedback * Sending an event every 5 mins(Heartbeat) (#146) * PR feedback to cleanup removed workflows * updating agent version for telemetry * updating agent version * Telemetry Updates (#149) * Telemetry Fixes 1. Added Log Generation Rate 2. Fixed parsing bugs 3. Added code to send Exceptions/errors * PR Feedback * Changes to send omsagent/omsagent-rs kubectl logs to App Insights (#159) * Changes to send omsagent/omsagent-rs kubectl logs to App Insights * PR Feedback * Rashmi/fluentd docker inventory (#160) * first stab * changes * changes * docker util changes * working tested util * input plugin and conf * changes * changes * changes * changes * changes * working containerinventory * fixing omi removal from container.conf * removing comments * file write and read * deleted containers working * changes * changes * socket timeout * deleting test files * adding log * fixing comment * appinsights changes * changes * tel changes * changes * changes * changes * changes * lib changes * changes * changes * fixes * PR comments * changes * updating the ownership * changes * changes * changes to container data * removing comment * changes * adding collection time * bug fix * env string truncation * changes for acs-engine test * Fix Telemetry Bug -- Initialize Telemetry Client after Initializing all required properties (#162) * Fix kube events memory leak due to yaml serialization for > 5k events (#163) * Setting Timeout for HTTP Client in PostDataHelper in outoms go plugin(#164) * Vishwa/perftelemetry 2 (#165) * add cpu usage telemetry for ds & rs * add cpu & memory usage telemetry for ds & rs * environment variable fix (#166) * environment variable fix * updating agent version * Fixing a bug where we were crashing due to container statuses not present when not was lost (#167) * Updating title * updating right versions for last release * Updating the break condition to look for end of response (#168) * Updating the break condition to look for end of response * changes for docker response * updating AgentVersion for telemetry * Updating readme for latest release changes * Changes - (#173) * use /var/log for state * new metric ContainerLogsAgentSideLatencyMs * new field 'timeOfComand' * Rashmi/kubenodeinventory (#174) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * Get cpuusage from usageseconds (#175) * Rashmi/kubenodeinventory (#176) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * Rashmi/kubenodeinventory (#178) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * Fixing an issue on the cpurate metric, which happens for the first time (when cache is empty) (#179) * Rashmi/kubenodeinventory (#180) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Exclude docker containers from container inventory (#181) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * Exclude pauseamd64 containers from container inventory (#182) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * Update agent version * Updating readme for the latest release * Fix indentation in kube.conf and update readme (#184) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * fixing indentation so that kube.conf contents can be used in config map in the yaml * updating readme to fix date and agent version * updating agent tag * Get Pods for current Node Only (#185) * Fix KubeAPI Calls to filter to get pods for current node * Reinstate log line * changes for container node inventory fixed type (#186) * Fix for mooncake (disable telemetry optionally) (#191) * disable telemetry option * fix a typo * CustomMetrics to ci_feature (#193) Custom Metrics changes to ci_feature * add ContainerNotRunning column to KubePodInventory * merge pr feedback: update name to ContainerStatusReason * Zero Fill for Missing Pod Phases, Change Namespace Dimension to Kubernetes namespace, as it might be confused with metrics namespace in Metrics Explorer (#194) * Zero Fill for Pod Counts by Phase * Change namespace dimension to Kubernetes namespace * No Retries for non 404 4xx errors (#196) * Update agent version for telemetry * Update readme for upcoming (ciprod01202019) release * fix readme formatting * fix formatting for readme * fix formatting for readme * fix readme * fix readme * fix agent version for telemetry * fix date in readme * update readme * Restart logs every 10MB instead of weekly (#198) * Rotate logs every 10MB instead of weekly * Removing some logging, fixed log rotation * update agent version for telemetry * update readme * Update kube.conf to use %STATE_DIR_WS% instead of hardcoded path * Fix AKSEngine Crash (#200) * hotfix * close resp.Body * remove chatty logs * membuf=5m and ignore files not updated since 5 mins * fix readme for new version * Fix the pod count in mdm agent plugin (#203) * Update readme * string freeze for out_mdm plugin * Vishwa/resourcecentric (#208) * resourceid fix (for AKS only) * fix name * Rashmi/win nodepool - PR (#206) * changes for win nodes enumeration * changes * changes * changes * node cpu metric rate changes * container cpu rate * changes * changes * changes * changes * changes * changes to include in_win_cadvisor_perf.rb file * send containerinventoryheartbeatevent * changes * cahnges for mdm metrics * changes * cahnges * changes * container states * changes * changes * changes for env variables * changes * changes * changes * changes * delete comments * changes * mutex changes * changes * changes * changes * telemetry fix for docker version * removing hardcoded values for mdm * update docker version * telemetry for windows cadvisor timeouts * exeception key update to computer * PR comments * adding os to container inventory for windows nodes (#210) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors (#211) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors * Fixing the bug, deferring telemetry changes for later * updating to lowercase compare for units (#212) * Merge from vishwa/telegraftcp to ci_feature for telegraf changes (#214) * merge from Vishwa/telegraf to Vishwa/telegraftcp for telegraf changes (#207) * add configuration for telegraf * fix for perms * fix telegraf config. * fix file location & config * update to config * fix namespace * trying different namespace and also debug=true * add placeholder for nodename * change namespace * updated config * fix uri * fix azMon settings * remove aad settings * add custom metrics regions * fix config * add support for replica-set config * fix oomkilled * Add telegraf 403 metric telemetry & non 403 trace telemetry * fix type * fix package * fix package import * fix filename * delete unused file * conf file for rs; fix 403counttotal metric for telegraf, remove host and use nodeName consistently, rename metrics * fix statefulsets * fix typo. * fix another typo. * fix telemetry * fix casing issue * fix comma issue. * disable telemetry for rs ; fix stateful set name * worksround for namespace fix * telegraf integration - v1 * telemetry changes for telegraf * telemetry & other changes * remove custom metric regions as we dont need anymore * remove un-needed files * fixes * exclude certain volumes and fix telemetry to not have computer & nodename as dimensions (redundant) * Vishwa/resourcecentric (#208) (#209) * resourceid fix (for AKS only) * fix name * near final metric shape * change from customlog to fixed type (InsightsMetrics) * fix PR feedback * fix pr feedback * Fix telemetry error for telegraf err count metric (#215) * Fix Unscheduled Pod bug, remove excess telemetry (#218) * Fix Unscheduled Pod bug, remove excess telemetry * Send Success Telemetry only once after startup for a node in a cluster for MDM Post * Sending telemetry for successful push to MDM every hour * Merge from Vishwa/promstandardmetrics into ci_feature (#220) * enable prometheus metrics collection in replica-set * fixing typos * fix config file path for replicaset * fix configuration * config changes * merge config/settings to ci_feature (#221) * updating fluentbit to use LOG_TAIL_PATH * changes * log exclusion pattern * changes * removing comments * adding enviornment varibale collection/disable * disable env var for cluster variable change * changes * toml parser changes * adding directory tomlrb * changes for container inventory * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Telemetry for config overrides * add schema version telemetry * reduce the number of api calls for namespace filtering add more telemetry for config processing move liveness probe & parser to this repo * optimize for default kube-system namespace log collection exclusion * Fix Scenario when Controller name is empty (#222) * fix ; * ContainerLog collection optimizations (#223) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * merge final changes for release from Vishwa/june2019agentrel to ci_feature (#224) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * fix a minor comment * * change flush from 5 to 10 secs based on perf findings * fix fluent bit tuning for perf run (#226) * fix fluent bit tuning for perf run * stop collecting our own partition * fix merge issue * add release notes for june release in ci_feature branch * fix title * update * fix title * Trim spaces in AKS_REGION (#233) This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Add Logs Size To Telemetry (#234) * Add Logs to telemetry * Using len instead of unsafe.Sizeof * Merge Vishwa/promcustommetrics to ci_feature (#237) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * Fix Region space error (#239) * Trim spaces in AKS_REGION This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Fix out_mdm parsing error * Removing buffer chunk size and buffer max size from fluentbit conf (#240) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * removing buffer chunk size and buffer max size from fluentbit conf * changes (#243) * Collect container last state (#235) * updating the OMS agent to also collect container last state * changed a comment * git surrounded ContainerLastStatus code in a begin/rescue block * added a lot of error checking and logging * Rashmi/fix prom telemetry (#247) * fix prom telemetry * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Merge Health Model work into ci_feature behind a feature flag Pending perf testing (#246) Merge Health to ci_feature * Fix Deserialization Bug (#249) * Fix the bug where capacity is not updated and cached value was being used (#251) * Fix the Capacity computation * fix node cpu and memory limits calculation * changes (#250) * Added new Custom Metrics Regions, fixed MDM plugin crash bug (#253) Added new regions, added handler for MDM plugin start * Add Missing Handlers (#254) * Added Missing Handlers * Return MultiEventStream.new instead of empty array (#256) * Added explicit require_relative to avoid loading errors (#258) * Adding explicit require_relative * Gangams/enable ai telemetry in mc (#252) * enable ai telemetry to configure different ikey and endpoint per cloud * Fixing null check out_mdm bug, tomlparser bug, exposing Replica Set service name as an ENV variable (#261) * Expose replica set service as an env variable * Fixing null check out_mdm bug, and tomlparser bug * Updating the env variable name to be more specific to health model * Changes for creating custom plugins with namespace settings for prometheus scraping (#262) * changes * changes * changes * changes * changes * changes * chnages * changes * telemetry changes * changes * Cherry-pick hotfix 09092019 to ci_feature (#265) * Gangams/add telemetry hybrid (#264) * add telemetry to detect the cloud, distro and kernel version * add null check since providerId optional * detect azurestack cloud * rename to KubernetesProviderID since ProviderID name already used in LA * capture workspaceCloud to the telemetry * trim the domain read from file * KubeMonAgentEvents changes to collect configuration events (#267) * changes * changes * changes * changes * changes * changes * env changes * changes * changes * changes * reverting * changes * cahnges * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * chnages * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Fix the Dupe Perf Data Issue from the DaemonSet (#266) * Dupe Perf Record Fix * PR for 1. Container Memory CPU monitor 2. Configuration for Node Conditions 3. Fixed Type Changes 4. Use Env variable, and health_forward (that handles network errors at init) 5. Unit Tests (#268) * init containers fix and other bug fixes (#269) * init container - KPI and kubeperf changes * changes * changes * changes * changes for empty array fix * changes * changes * pod inventory exception fix * nil check changes * changes * fixing typo * changes * changes * PR - feedback * remove comment * tag pass changes * changes * tagdrop changes * changes * changes * Send agg monitor signal on details change (#270) send when an agg monitor details change, but state did not change * bug fixes for error (#274) * Fix to use declaration and assignment instead of assignment (#275) * bug fixes for error * adding declaration to assignment * 1. Added telemetry (#277) 2. Configuration property changes 3. Bug fixes for a. unscheduled pods returning green 3b. Sometimes, the details hash of agg monitors are different because the order of elements inside the array is different, causing the records to be sent * Bug fix to remove unused variable (#281) * bug fixes for error * adding declaration to assignment * removing unused variable * Fix the WARN<->WARNING typo (#282) * Bug Fixes 1. telemetry send throwing exception if records not initialized 2. permissions error in on-prem clusters (#284) * Bug fixes 1. not writeable, telemetry error * Change to state_WS_dir * Fix Require relative revert (#287) * Bug Fixes for exceptions in telemetry, remove limit set check (#289) * Bug Fixes 10222019 * Initialize container_cpu_memory_records in fhmb * Added telemetry to investigate health exceptions * Set frozen_string_literal to true * Send event once per container when lookup is empty, or limit is an array * Unit Tests, Use RS and POD to determine workload * Fixed Node Condition Bug, added exception handling to return get_rs_owner_ref * Fix the bug where if a warning condition appears before fail condition, the node condition is reported as warning instead of fail. Also fix the node conditions state to consider unknown as a failure state (#292) * Fix for Nodes Aspect not showing up in draft cluster (#294) * Fix the issue where the health tree is inconsistent if a deployment is deleted (#295) * Rashmi/1 16 test (#297) * health deployment update * apps v1 changes for deployment * changes * changes to use relicasets and api groups * Fix duplicate records in container memory/cpu samples (#298) * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * fix exceptions (#306) * Merge Branch morgan into ci_feature (#308) * Fixes : 1) Disable health (for time being) - in DS & RS 2) Disable MDM (for time being) - in DS & RS 3) Merge kubeperf into kubenode & kubepod 4) Made scheduling predictable for kubenode & kubepod 5) Enable containerlog enrichment fields (timeofcommand, containername & containerimage) as a configurable setting (default = true/ON) - Also add telemetry for it 6) Filter OUT type!=Normal events for k8s events 7) AppInsights telemetry async 8) Fix double calling bug in in_win_cadvisor_perf 9) Add connect timeout (20secs) & read timeout (40 secs) for all cadvisor api calls & also for all kubernetes api server calls 10) Fix batchTime for kubepods to be one before making api server call (rather than after making the call, which will make it fluctuate based on api server latency for the call) * fix setting issue for the new enrichcontainerlog setting * fix compilation issue * fix another compilation issue * fix emit issues * fix a nil issue * fix mising tag * * Fix all input plugins for scheduling issue * Merge kubeservices with kubepodinventory (reduce RS to API server by one more) * Remove Kubelogs (not used) * Fix liveness probe * Disable enrichment by default for container logs * Move to yajl json parser across the board for docker provier code * Remove unused files * fix removed files * fix timeofcommand and remove a duplicate entry for a health file. * Rashmi/http leak fixes (#301) * changes for http connection close * close socket in ensure * adding nil check * Rashmi/http leak fixes (#303) * changes for http connection close * close socket in ensure * adding nil check * adding missing end * use yajl for events & nodes parsing. * Rashmi/http leak fixes (#304) * changes for http connection close * close socket in ensure * adding nil check * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * adding missing end * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * changes for chunking * telemetry changes * some fixes * bug fix * changing to have morgan changes only * add new line * use polltime for metrics and disable out_forward for health * enable mdm & health * few optimizations * do not remove time of command make kube.conf same as scale tested config * remove comments from container.conf * remove flush comment for ai telemetry * remove commented code lines * fix config * remove timeofcommand when enrichment==false * fix config * enable mdm filter * Rashmi/api chunk (#307) * changes * changes * refactor changes * changes * changes * changes * changes * node changes * changes * changes * changes * changes * adding open and read timeouts for api client * removing comments * updating chunk size * Update Readme * add back timeofcommand (#310) * Adding new cpu and memory limits to readme * CAdvisor to use 10255/10250 based on env variable (#321) * CAdvisor secure port changes (#320) * cadvsior secure port changes * update to use secure/insecure port for cadvisor * telemetry changes * fix bug * bug fix * changes * Adding cadvisor uri log * switching defaults * update readme * changes * changing font for code change and customer impact * For ARO, stop collecting inventory of master and infra (#323) * filter out infra and master nodes inventory for aro * filterout pods info scheduled master and infra nodes * fix redundant KubernetesApiClient name * filter out events sourced from master and infra nodes * fix in kubeapi * add the comments * fix pr feedback * minor updates * fix pr feedback * encode special characters in query * some refactoring * MDM plugin support for large scale clusters (#324) * Batch Commit * WIP: Committing move logic from filter to input * WIP : MDM plugins for scale clusters * Bug fixes 1. cpu percentage 2. bytesize on array. Remove log line * Fixing metric value in cadvisor2mdm plugin * WIP to laptop * Working version with cadvisor changes * Fix Health cpu usage * Added uri for cadvisor failure * Add Null check for kube api responses in in_kube_health (#325) * Fix casing bug (#326) * Missed kube.conf update (#327) * changes to use msi if service principal does not exist (#328) changes to use msi if service principal does not exist (#328) * Adding caseinsensitive compare (#330) Adding case insensitive compare * gpu monitoring (#329) * gpu monitoring * Emit info log for tests for the new insightsmetrics data stream * Update release notes * MDM batch Bug (#336) * kube evnts bug fix (#335) * Update readme.md * Rate Limiting changes (#338) * Add header for throttling Add telemetry for throttling Flush 15secs for telegraf metrics * * Add requestid & log it for errors * Fix bug in semver * Gangams/add support cri runtime docker env (#337) * trim log tag for cri compatible log lines * fix variable declaration * debug log messages * trim spaces * remove debug logs * fix build error * add pods api in cadvisor * wip * wip * wip * wip * fix bug * fix bug * fix bug with end * fix bug in the reference * fix bug * fix telemetry * fix image and imageid bug * fix containerid bug * fix envvars * fix syntax error * fix syntax error * fix env and command * handle labels and annotation fieldref as envs * wip * implement obtain envvars * refactor the code * fix log warn * add more logging * fix npe * revert kubelet api to get the envvars * fix minor issue * fix log message * comment valueFrom for now since this causing some issue * fix formatting issue * fix bug * add resourcefieldref support in env * include init containers as well * fix bug * fix typo * use chomp instead of delete_suffix * add secretkeyref support * add secretkeyref support * fix bug * handle dockerversion for non-docker runtimes * fix comment * handle kubelet metrics to handle runtime and k8s versions * remove _total metrics until we support 1.18 * fetch envvars via proc environ * fix undefined vars * fix undefined vars * add init containers * split on null character * consider crio containers main proceses for envvars * unschedule pods or containers with pull issues * fix issue * fix feedback * update to azmon parsers * fix file extension * add some debug messages * fix timestamp issue * fix pr feedback * refactor linux containerenvvars code * refactor windows container inventory code * fix rb file load error from in_kube_node inventory * add new rb file base_container.data * turnoff docker container inventory * cleanup debug logs * fix npe in log message * avoid environ for terminated containers * fix pr feedback * remove empty file * clean up debug logs * Gangams/fix cri exceptions (#339) * fix nil exception * add exception logging message * fix npe * fix log exception (#340) * Updating release notes, mdm bug fix (#333) (#343) Updating release notes, mdm bug fix for release Co-authored-by: rashmichandrashekar <rashmy@microsoft.com> * update release notes for 04162020 release (#346) * update release notes * fix pr feedback Co-authored-by: Vishwanath Narasimhan <visnara@microsoft.com> Co-authored-by: rashmy <rashmy@RASHMY-PC2> Co-authored-by: r-dilip <dilip.rangarajan@gmail.com> Co-authored-by: rashmichandrashekar <rashmy@microsoft.com> Co-authored-by: Kaveesh Dubey <kadubey@microsoft.com> Co-authored-by: David Michelman <daweim0@gmail.com> Co-authored-by: Dilip Raghunathan <dilipr@microsoft.com>
ayusheesingh-zz
pushed a commit
that referenced
this pull request
Jun 27, 2020
…d) (#311) (#316) * Updatng release history * fixing the plugin logs for emit stream * updating log message * Remove Log Processing from fluentd configuration * Remove plugin references from base_container.data * Dilipr/fluent bit log processing (#126) * Build out_oms.so and include in docker-cimprov package * Adding fluent-bit-config file to base container * PR Feedback * Adding out_oms.conf to base_container.data * PR Feedback * Making the critical section as small as possible * PR Feedback * Fixing the newline bug for Computer, and changing containerId to Id * Dilipr/glide updates (#127) * Updating glide.* files to include lumberjack * containerID="" for pull issues * Using KubeAPI for getting image,name. Adding more logs (#129) * Using KubeAPI for getting image,name. Adding more logs * Moving log file and state file to within the omsagent container * Changing log and state paths * Dilipr/mark comments (#130) * Marks Comments + Error Handling * Drop records from files that are not in k8s format * Remove unnecessary log line' * Adding Log to the file that doesn't conform to the expected format * Rashmi/segfault latest (#132) * adding null checks in all providers * fixing type * fixing type * adding more null checks * update cjson * Adding a missed null check (#135) * reusing some variables (#136) * Rashmi/cjson delete null check (#138) * adding null check for cjson-delete * null chk * removing null check * updating log level to debug for some provider workflows (#139) * Fixing CPU Utilization and removing Fluent-bit filters (#140) Removing fluent-bit filters, CPU optimizations * Minor tweaks 1. Remove some logging 2. Added more Error Handling 3. Continue when there is an error with k8s api (#141) * Removing some logs, added more error checking, continue on kube-api error * Return FLB OK for json Marshall error, instead of RETRY * * Change FluentBit flush interval to 30 secs (from 5 secs) * Remove ContainerPerf, ContainerServiceLog,ContainerProcess (OMI workflows) for Daemonset * Container Log Telemetry * Fixing an issue with Send Init Event if Telemetry is not initialized properly, tab to whitespace in conf file * PR feedback * PR feedback * Sending an event every 5 mins(Heartbeat) (#146) * PR feedback to cleanup removed workflows * updating agent version for telemetry * updating agent version * Telemetry Updates (#149) * Telemetry Fixes 1. Added Log Generation Rate 2. Fixed parsing bugs 3. Added code to send Exceptions/errors * PR Feedback * Changes to send omsagent/omsagent-rs kubectl logs to App Insights (#159) * Changes to send omsagent/omsagent-rs kubectl logs to App Insights * PR Feedback * Rashmi/fluentd docker inventory (#160) * first stab * changes * changes * docker util changes * working tested util * input plugin and conf * changes * changes * changes * changes * changes * working containerinventory * fixing omi removal from container.conf * removing comments * file write and read * deleted containers working * changes * changes * socket timeout * deleting test files * adding log * fixing comment * appinsights changes * changes * tel changes * changes * changes * changes * changes * lib changes * changes * changes * fixes * PR comments * changes * updating the ownership * changes * changes * changes to container data * removing comment * changes * adding collection time * bug fix * env string truncation * changes for acs-engine test * Fix Telemetry Bug -- Initialize Telemetry Client after Initializing all required properties (#162) * Fix kube events memory leak due to yaml serialization for > 5k events (#163) * Setting Timeout for HTTP Client in PostDataHelper in outoms go plugin(#164) * Vishwa/perftelemetry 2 (#165) * add cpu usage telemetry for ds & rs * add cpu & memory usage telemetry for ds & rs * environment variable fix (#166) * environment variable fix * updating agent version * Fixing a bug where we were crashing due to container statuses not present when not was lost (#167) * Updating title * updating right versions for last release * Updating the break condition to look for end of response (#168) * Updating the break condition to look for end of response * changes for docker response * updating AgentVersion for telemetry * Updating readme for latest release changes * Changes - (#173) * use /var/log for state * new metric ContainerLogsAgentSideLatencyMs * new field 'timeOfComand' * Rashmi/kubenodeinventory (#174) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * Get cpuusage from usageseconds (#175) * Rashmi/kubenodeinventory (#176) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * Rashmi/kubenodeinventory (#178) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * Fixing an issue on the cpurate metric, which happens for the first time (when cache is empty) (#179) * Rashmi/kubenodeinventory (#180) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Exclude docker containers from container inventory (#181) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * Exclude pauseamd64 containers from container inventory (#182) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * Update agent version * Updating readme for the latest release * Fix indentation in kube.conf and update readme (#184) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * fixing indentation so that kube.conf contents can be used in config map in the yaml * updating readme to fix date and agent version * updating agent tag * Get Pods for current Node Only (#185) * Fix KubeAPI Calls to filter to get pods for current node * Reinstate log line * changes for container node inventory fixed type (#186) * Fix for mooncake (disable telemetry optionally) (#191) * disable telemetry option * fix a typo * CustomMetrics to ci_feature (#193) Custom Metrics changes to ci_feature * add ContainerNotRunning column to KubePodInventory * merge pr feedback: update name to ContainerStatusReason * Zero Fill for Missing Pod Phases, Change Namespace Dimension to Kubernetes namespace, as it might be confused with metrics namespace in Metrics Explorer (#194) * Zero Fill for Pod Counts by Phase * Change namespace dimension to Kubernetes namespace * No Retries for non 404 4xx errors (#196) * Update agent version for telemetry * Update readme for upcoming (ciprod01202019) release * fix readme formatting * fix formatting for readme * fix formatting for readme * fix readme * fix readme * fix agent version for telemetry * fix date in readme * update readme * Restart logs every 10MB instead of weekly (#198) * Rotate logs every 10MB instead of weekly * Removing some logging, fixed log rotation * update agent version for telemetry * update readme * Update kube.conf to use %STATE_DIR_WS% instead of hardcoded path * Fix AKSEngine Crash (#200) * hotfix * close resp.Body * remove chatty logs * membuf=5m and ignore files not updated since 5 mins * fix readme for new version * Fix the pod count in mdm agent plugin (#203) * Update readme * string freeze for out_mdm plugin * Vishwa/resourcecentric (#208) * resourceid fix (for AKS only) * fix name * Rashmi/win nodepool - PR (#206) * changes for win nodes enumeration * changes * changes * changes * node cpu metric rate changes * container cpu rate * changes * changes * changes * changes * changes * changes to include in_win_cadvisor_perf.rb file * send containerinventoryheartbeatevent * changes * cahnges for mdm metrics * changes * cahnges * changes * container states * changes * changes * changes for env variables * changes * changes * changes * changes * delete comments * changes * mutex changes * changes * changes * changes * telemetry fix for docker version * removing hardcoded values for mdm * update docker version * telemetry for windows cadvisor timeouts * exeception key update to computer * PR comments * adding os to container inventory for windows nodes (#210) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors (#211) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors * Fixing the bug, deferring telemetry changes for later * updating to lowercase compare for units (#212) * Merge from vishwa/telegraftcp to ci_feature for telegraf changes (#214) * merge from Vishwa/telegraf to Vishwa/telegraftcp for telegraf changes (#207) * add configuration for telegraf * fix for perms * fix telegraf config. * fix file location & config * update to config * fix namespace * trying different namespace and also debug=true * add placeholder for nodename * change namespace * updated config * fix uri * fix azMon settings * remove aad settings * add custom metrics regions * fix config * add support for replica-set config * fix oomkilled * Add telegraf 403 metric telemetry & non 403 trace telemetry * fix type * fix package * fix package import * fix filename * delete unused file * conf file for rs; fix 403counttotal metric for telegraf, remove host and use nodeName consistently, rename metrics * fix statefulsets * fix typo. * fix another typo. * fix telemetry * fix casing issue * fix comma issue. * disable telemetry for rs ; fix stateful set name * worksround for namespace fix * telegraf integration - v1 * telemetry changes for telegraf * telemetry & other changes * remove custom metric regions as we dont need anymore * remove un-needed files * fixes * exclude certain volumes and fix telemetry to not have computer & nodename as dimensions (redundant) * Vishwa/resourcecentric (#208) (#209) * resourceid fix (for AKS only) * fix name * near final metric shape * change from customlog to fixed type (InsightsMetrics) * fix PR feedback * fix pr feedback * Fix telemetry error for telegraf err count metric (#215) * Fix Unscheduled Pod bug, remove excess telemetry (#218) * Fix Unscheduled Pod bug, remove excess telemetry * Send Success Telemetry only once after startup for a node in a cluster for MDM Post * Sending telemetry for successful push to MDM every hour * Merge from Vishwa/promstandardmetrics into ci_feature (#220) * enable prometheus metrics collection in replica-set * fixing typos * fix config file path for replicaset * fix configuration * config changes * merge config/settings to ci_feature (#221) * updating fluentbit to use LOG_TAIL_PATH * changes * log exclusion pattern * changes * removing comments * adding enviornment varibale collection/disable * disable env var for cluster variable change * changes * toml parser changes * adding directory tomlrb * changes for container inventory * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Telemetry for config overrides * add schema version telemetry * reduce the number of api calls for namespace filtering add more telemetry for config processing move liveness probe & parser to this repo * optimize for default kube-system namespace log collection exclusion * Fix Scenario when Controller name is empty (#222) * fix ; * ContainerLog collection optimizations (#223) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * merge final changes for release from Vishwa/june2019agentrel to ci_feature (#224) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * fix a minor comment * * change flush from 5 to 10 secs based on perf findings * fix fluent bit tuning for perf run (#226) * fix fluent bit tuning for perf run * stop collecting our own partition * fix merge issue * add release notes for june release in ci_feature branch * fix title * update * fix title * Trim spaces in AKS_REGION (#233) This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Add Logs Size To Telemetry (#234) * Add Logs to telemetry * Using len instead of unsafe.Sizeof * Merge Vishwa/promcustommetrics to ci_feature (#237) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * Fix Region space error (#239) * Trim spaces in AKS_REGION This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Fix out_mdm parsing error * Removing buffer chunk size and buffer max size from fluentbit conf (#240) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * removing buffer chunk size and buffer max size from fluentbit conf * changes (#243) * Collect container last state (#235) * updating the OMS agent to also collect container last state * changed a comment * git surrounded ContainerLastStatus code in a begin/rescue block * added a lot of error checking and logging * Rashmi/fix prom telemetry (#247) * fix prom telemetry * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Merge Health Model work into ci_feature behind a feature flag Pending perf testing (#246) Merge Health to ci_feature * Fix Deserialization Bug (#249) * Fix the bug where capacity is not updated and cached value was being used (#251) * Fix the Capacity computation * fix node cpu and memory limits calculation * changes (#250) * Added new Custom Metrics Regions, fixed MDM plugin crash bug (#253) Added new regions, added handler for MDM plugin start * Add Missing Handlers (#254) * Added Missing Handlers * Return MultiEventStream.new instead of empty array (#256) * Added explicit require_relative to avoid loading errors (#258) * Adding explicit require_relative * Gangams/enable ai telemetry in mc (#252) * enable ai telemetry to configure different ikey and endpoint per cloud * Fixing null check out_mdm bug, tomlparser bug, exposing Replica Set service name as an ENV variable (#261) * Expose replica set service as an env variable * Fixing null check out_mdm bug, and tomlparser bug * Updating the env variable name to be more specific to health model * Changes for creating custom plugins with namespace settings for prometheus scraping (#262) * changes * changes * changes * changes * changes * changes * chnages * changes * telemetry changes * changes * Cherry-pick hotfix 09092019 to ci_feature (#265) * Gangams/add telemetry hybrid (#264) * add telemetry to detect the cloud, distro and kernel version * add null check since providerId optional * detect azurestack cloud * rename to KubernetesProviderID since ProviderID name already used in LA * capture workspaceCloud to the telemetry * trim the domain read from file * KubeMonAgentEvents changes to collect configuration events (#267) * changes * changes * changes * changes * changes * changes * env changes * changes * changes * changes * reverting * changes * cahnges * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * chnages * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Fix the Dupe Perf Data Issue from the DaemonSet (#266) * Dupe Perf Record Fix * PR for 1. Container Memory CPU monitor 2. Configuration for Node Conditions 3. Fixed Type Changes 4. Use Env variable, and health_forward (that handles network errors at init) 5. Unit Tests (#268) * init containers fix and other bug fixes (#269) * init container - KPI and kubeperf changes * changes * changes * changes * changes for empty array fix * changes * changes * pod inventory exception fix * nil check changes * changes * fixing typo * changes * changes * PR - feedback * remove comment * tag pass changes * changes * tagdrop changes * changes * changes * Send agg monitor signal on details change (#270) send when an agg monitor details change, but state did not change * bug fixes for error (#274) * Fix to use declaration and assignment instead of assignment (#275) * bug fixes for error * adding declaration to assignment * 1. Added telemetry (#277) 2. Configuration property changes 3. Bug fixes for a. unscheduled pods returning green 3b. Sometimes, the details hash of agg monitors are different because the order of elements inside the array is different, causing the records to be sent * Bug fix to remove unused variable (#281) * bug fixes for error * adding declaration to assignment * removing unused variable * Fix the WARN<->WARNING typo (#282) * Bug Fixes 1. telemetry send throwing exception if records not initialized 2. permissions error in on-prem clusters (#284) * Bug fixes 1. not writeable, telemetry error * Change to state_WS_dir * Fix Require relative revert (#287) * Bug Fixes for exceptions in telemetry, remove limit set check (#289) * Bug Fixes 10222019 * Initialize container_cpu_memory_records in fhmb * Added telemetry to investigate health exceptions * Set frozen_string_literal to true * Send event once per container when lookup is empty, or limit is an array * Unit Tests, Use RS and POD to determine workload * Fixed Node Condition Bug, added exception handling to return get_rs_owner_ref * Fix the bug where if a warning condition appears before fail condition, the node condition is reported as warning instead of fail. Also fix the node conditions state to consider unknown as a failure state (#292) * Fix for Nodes Aspect not showing up in draft cluster (#294) * Fix the issue where the health tree is inconsistent if a deployment is deleted (#295) * Rashmi/1 16 test (#297) * health deployment update * apps v1 changes for deployment * changes * changes to use relicasets and api groups * Fix duplicate records in container memory/cpu samples (#298) * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * fix exceptions (#306) * Merge Branch morgan into ci_feature (#308) * Fixes : 1) Disable health (for time being) - in DS & RS 2) Disable MDM (for time being) - in DS & RS 3) Merge kubeperf into kubenode & kubepod 4) Made scheduling predictable for kubenode & kubepod 5) Enable containerlog enrichment fields (timeofcommand, containername & containerimage) as a configurable setting (default = true/ON) - Also add telemetry for it 6) Filter OUT type!=Normal events for k8s events 7) AppInsights telemetry async 8) Fix double calling bug in in_win_cadvisor_perf 9) Add connect timeout (20secs) & read timeout (40 secs) for all cadvisor api calls & also for all kubernetes api server calls 10) Fix batchTime for kubepods to be one before making api server call (rather than after making the call, which will make it fluctuate based on api server latency for the call) * fix setting issue for the new enrichcontainerlog setting * fix compilation issue * fix another compilation issue * fix emit issues * fix a nil issue * fix mising tag * * Fix all input plugins for scheduling issue * Merge kubeservices with kubepodinventory (reduce RS to API server by one more) * Remove Kubelogs (not used) * Fix liveness probe * Disable enrichment by default for container logs * Move to yajl json parser across the board for docker provier code * Remove unused files * fix removed files * fix timeofcommand and remove a duplicate entry for a health file. * Rashmi/http leak fixes (#301) * changes for http connection close * close socket in ensure * adding nil check * Rashmi/http leak fixes (#303) * changes for http connection close * close socket in ensure * adding nil check * adding missing end * use yajl for events & nodes parsing. * Rashmi/http leak fixes (#304) * changes for http connection close * close socket in ensure * adding nil check * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * adding missing end * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * changes for chunking * telemetry changes * some fixes * bug fix * changing to have morgan changes only * add new line * use polltime for metrics and disable out_forward for health * enable mdm & health * few optimizations * do not remove time of command make kube.conf same as scale tested config * remove comments from container.conf * remove flush comment for ai telemetry * remove commented code lines * fix config * remove timeofcommand when enrichment==false * fix config * enable mdm filter * Rashmi/api chunk (#307) * changes * changes * refactor changes * changes * changes * changes * changes * node changes * changes * changes * changes * changes * adding open and read timeouts for api client * removing comments * updating chunk size * Update Readme * add back timeofcommand (#310)
ayusheesingh-zz
pushed a commit
that referenced
this pull request
Jun 27, 2020
* Updatng release history * fixing the plugin logs for emit stream * updating log message * Remove Log Processing from fluentd configuration * Remove plugin references from base_container.data * Dilipr/fluent bit log processing (#126) * Build out_oms.so and include in docker-cimprov package * Adding fluent-bit-config file to base container * PR Feedback * Adding out_oms.conf to base_container.data * PR Feedback * Making the critical section as small as possible * PR Feedback * Fixing the newline bug for Computer, and changing containerId to Id * Dilipr/glide updates (#127) * Updating glide.* files to include lumberjack * containerID="" for pull issues * Using KubeAPI for getting image,name. Adding more logs (#129) * Using KubeAPI for getting image,name. Adding more logs * Moving log file and state file to within the omsagent container * Changing log and state paths * Dilipr/mark comments (#130) * Marks Comments + Error Handling * Drop records from files that are not in k8s format * Remove unnecessary log line' * Adding Log to the file that doesn't conform to the expected format * Rashmi/segfault latest (#132) * adding null checks in all providers * fixing type * fixing type * adding more null checks * update cjson * Adding a missed null check (#135) * reusing some variables (#136) * Rashmi/cjson delete null check (#138) * adding null check for cjson-delete * null chk * removing null check * updating log level to debug for some provider workflows (#139) * Fixing CPU Utilization and removing Fluent-bit filters (#140) Removing fluent-bit filters, CPU optimizations * Minor tweaks 1. Remove some logging 2. Added more Error Handling 3. Continue when there is an error with k8s api (#141) * Removing some logs, added more error checking, continue on kube-api error * Return FLB OK for json Marshall error, instead of RETRY * * Change FluentBit flush interval to 30 secs (from 5 secs) * Remove ContainerPerf, ContainerServiceLog,ContainerProcess (OMI workflows) for Daemonset * Container Log Telemetry * Fixing an issue with Send Init Event if Telemetry is not initialized properly, tab to whitespace in conf file * PR feedback * PR feedback * Sending an event every 5 mins(Heartbeat) (#146) * PR feedback to cleanup removed workflows * updating agent version for telemetry * updating agent version * Telemetry Updates (#149) * Telemetry Fixes 1. Added Log Generation Rate 2. Fixed parsing bugs 3. Added code to send Exceptions/errors * PR Feedback * Changes to send omsagent/omsagent-rs kubectl logs to App Insights (#159) * Changes to send omsagent/omsagent-rs kubectl logs to App Insights * PR Feedback * Rashmi/fluentd docker inventory (#160) * first stab * changes * changes * docker util changes * working tested util * input plugin and conf * changes * changes * changes * changes * changes * working containerinventory * fixing omi removal from container.conf * removing comments * file write and read * deleted containers working * changes * changes * socket timeout * deleting test files * adding log * fixing comment * appinsights changes * changes * tel changes * changes * changes * changes * changes * lib changes * changes * changes * fixes * PR comments * changes * updating the ownership * changes * changes * changes to container data * removing comment * changes * adding collection time * bug fix * env string truncation * changes for acs-engine test * Fix Telemetry Bug -- Initialize Telemetry Client after Initializing all required properties (#162) * Fix kube events memory leak due to yaml serialization for > 5k events (#163) * Setting Timeout for HTTP Client in PostDataHelper in outoms go plugin(#164) * Vishwa/perftelemetry 2 (#165) * add cpu usage telemetry for ds & rs * add cpu & memory usage telemetry for ds & rs * environment variable fix (#166) * environment variable fix * updating agent version * Fixing a bug where we were crashing due to container statuses not present when not was lost (#167) * Updating title * updating right versions for last release * Updating the break condition to look for end of response (#168) * Updating the break condition to look for end of response * changes for docker response * updating AgentVersion for telemetry * Updating readme for latest release changes * Changes - (#173) * use /var/log for state * new metric ContainerLogsAgentSideLatencyMs * new field 'timeOfComand' * Rashmi/kubenodeinventory (#174) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * Get cpuusage from usageseconds (#175) * Rashmi/kubenodeinventory (#176) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * Rashmi/kubenodeinventory (#178) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * Fixing an issue on the cpurate metric, which happens for the first time (when cache is empty) (#179) * Rashmi/kubenodeinventory (#180) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Exclude docker containers from container inventory (#181) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * Exclude pauseamd64 containers from container inventory (#182) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * Update agent version * Updating readme for the latest release * Fix indentation in kube.conf and update readme (#184) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * fixing indentation so that kube.conf contents can be used in config map in the yaml * updating readme to fix date and agent version * updating agent tag * Get Pods for current Node Only (#185) * Fix KubeAPI Calls to filter to get pods for current node * Reinstate log line * changes for container node inventory fixed type (#186) * Fix for mooncake (disable telemetry optionally) (#191) * disable telemetry option * fix a typo * CustomMetrics to ci_feature (#193) Custom Metrics changes to ci_feature * add ContainerNotRunning column to KubePodInventory * merge pr feedback: update name to ContainerStatusReason * Zero Fill for Missing Pod Phases, Change Namespace Dimension to Kubernetes namespace, as it might be confused with metrics namespace in Metrics Explorer (#194) * Zero Fill for Pod Counts by Phase * Change namespace dimension to Kubernetes namespace * No Retries for non 404 4xx errors (#196) * Update agent version for telemetry * Update readme for upcoming (ciprod01202019) release * fix readme formatting * fix formatting for readme * fix formatting for readme * fix readme * fix readme * fix agent version for telemetry * fix date in readme * update readme * Restart logs every 10MB instead of weekly (#198) * Rotate logs every 10MB instead of weekly * Removing some logging, fixed log rotation * update agent version for telemetry * update readme * Update kube.conf to use %STATE_DIR_WS% instead of hardcoded path * Fix AKSEngine Crash (#200) * hotfix * close resp.Body * remove chatty logs * membuf=5m and ignore files not updated since 5 mins * fix readme for new version * Fix the pod count in mdm agent plugin (#203) * Update readme * string freeze for out_mdm plugin * Vishwa/resourcecentric (#208) * resourceid fix (for AKS only) * fix name * Rashmi/win nodepool - PR (#206) * changes for win nodes enumeration * changes * changes * changes * node cpu metric rate changes * container cpu rate * changes * changes * changes * changes * changes * changes to include in_win_cadvisor_perf.rb file * send containerinventoryheartbeatevent * changes * cahnges for mdm metrics * changes * cahnges * changes * container states * changes * changes * changes for env variables * changes * changes * changes * changes * delete comments * changes * mutex changes * changes * changes * changes * telemetry fix for docker version * removing hardcoded values for mdm * update docker version * telemetry for windows cadvisor timeouts * exeception key update to computer * PR comments * adding os to container inventory for windows nodes (#210) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors (#211) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors * Fixing the bug, deferring telemetry changes for later * updating to lowercase compare for units (#212) * Merge from vishwa/telegraftcp to ci_feature for telegraf changes (#214) * merge from Vishwa/telegraf to Vishwa/telegraftcp for telegraf changes (#207) * add configuration for telegraf * fix for perms * fix telegraf config. * fix file location & config * update to config * fix namespace * trying different namespace and also debug=true * add placeholder for nodename * change namespace * updated config * fix uri * fix azMon settings * remove aad settings * add custom metrics regions * fix config * add support for replica-set config * fix oomkilled * Add telegraf 403 metric telemetry & non 403 trace telemetry * fix type * fix package * fix package import * fix filename * delete unused file * conf file for rs; fix 403counttotal metric for telegraf, remove host and use nodeName consistently, rename metrics * fix statefulsets * fix typo. * fix another typo. * fix telemetry * fix casing issue * fix comma issue. * disable telemetry for rs ; fix stateful set name * worksround for namespace fix * telegraf integration - v1 * telemetry changes for telegraf * telemetry & other changes * remove custom metric regions as we dont need anymore * remove un-needed files * fixes * exclude certain volumes and fix telemetry to not have computer & nodename as dimensions (redundant) * Vishwa/resourcecentric (#208) (#209) * resourceid fix (for AKS only) * fix name * near final metric shape * change from customlog to fixed type (InsightsMetrics) * fix PR feedback * fix pr feedback * Fix telemetry error for telegraf err count metric (#215) * Fix Unscheduled Pod bug, remove excess telemetry (#218) * Fix Unscheduled Pod bug, remove excess telemetry * Send Success Telemetry only once after startup for a node in a cluster for MDM Post * Sending telemetry for successful push to MDM every hour * Merge from Vishwa/promstandardmetrics into ci_feature (#220) * enable prometheus metrics collection in replica-set * fixing typos * fix config file path for replicaset * fix configuration * config changes * merge config/settings to ci_feature (#221) * updating fluentbit to use LOG_TAIL_PATH * changes * log exclusion pattern * changes * removing comments * adding enviornment varibale collection/disable * disable env var for cluster variable change * changes * toml parser changes * adding directory tomlrb * changes for container inventory * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Telemetry for config overrides * add schema version telemetry * reduce the number of api calls for namespace filtering add more telemetry for config processing move liveness probe & parser to this repo * optimize for default kube-system namespace log collection exclusion * Fix Scenario when Controller name is empty (#222) * fix ; * ContainerLog collection optimizations (#223) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * merge final changes for release from Vishwa/june2019agentrel to ci_feature (#224) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * fix a minor comment * * change flush from 5 to 10 secs based on perf findings * fix fluent bit tuning for perf run (#226) * fix fluent bit tuning for perf run * stop collecting our own partition * fix merge issue * add release notes for june release in ci_feature branch * fix title * update * fix title * Trim spaces in AKS_REGION (#233) This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Add Logs Size To Telemetry (#234) * Add Logs to telemetry * Using len instead of unsafe.Sizeof * Merge Vishwa/promcustommetrics to ci_feature (#237) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * Fix Region space error (#239) * Trim spaces in AKS_REGION This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Fix out_mdm parsing error * Removing buffer chunk size and buffer max size from fluentbit conf (#240) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * removing buffer chunk size and buffer max size from fluentbit conf * changes (#243) * Collect container last state (#235) * updating the OMS agent to also collect container last state * changed a comment * git surrounded ContainerLastStatus code in a begin/rescue block * added a lot of error checking and logging * Rashmi/fix prom telemetry (#247) * fix prom telemetry * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Merge Health Model work into ci_feature behind a feature flag Pending perf testing (#246) Merge Health to ci_feature * Fix Deserialization Bug (#249) * Fix the bug where capacity is not updated and cached value was being used (#251) * Fix the Capacity computation * fix node cpu and memory limits calculation * changes (#250) * Added new Custom Metrics Regions, fixed MDM plugin crash bug (#253) Added new regions, added handler for MDM plugin start * Add Missing Handlers (#254) * Added Missing Handlers * Return MultiEventStream.new instead of empty array (#256) * Added explicit require_relative to avoid loading errors (#258) * Adding explicit require_relative * Gangams/enable ai telemetry in mc (#252) * enable ai telemetry to configure different ikey and endpoint per cloud * Fixing null check out_mdm bug, tomlparser bug, exposing Replica Set service name as an ENV variable (#261) * Expose replica set service as an env variable * Fixing null check out_mdm bug, and tomlparser bug * Updating the env variable name to be more specific to health model * Changes for creating custom plugins with namespace settings for prometheus scraping (#262) * changes * changes * changes * changes * changes * changes * chnages * changes * telemetry changes * changes * Cherry-pick hotfix 09092019 to ci_feature (#265) * Gangams/add telemetry hybrid (#264) * add telemetry to detect the cloud, distro and kernel version * add null check since providerId optional * detect azurestack cloud * rename to KubernetesProviderID since ProviderID name already used in LA * capture workspaceCloud to the telemetry * trim the domain read from file * KubeMonAgentEvents changes to collect configuration events (#267) * changes * changes * changes * changes * changes * changes * env changes * changes * changes * changes * reverting * changes * cahnges * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * chnages * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Fix the Dupe Perf Data Issue from the DaemonSet (#266) * Dupe Perf Record Fix * PR for 1. Container Memory CPU monitor 2. Configuration for Node Conditions 3. Fixed Type Changes 4. Use Env variable, and health_forward (that handles network errors at init) 5. Unit Tests (#268) * init containers fix and other bug fixes (#269) * init container - KPI and kubeperf changes * changes * changes * changes * changes for empty array fix * changes * changes * pod inventory exception fix * nil check changes * changes * fixing typo * changes * changes * PR - feedback * remove comment * tag pass changes * changes * tagdrop changes * changes * changes * Send agg monitor signal on details change (#270) send when an agg monitor details change, but state did not change * bug fixes for error (#274) * Fix to use declaration and assignment instead of assignment (#275) * bug fixes for error * adding declaration to assignment * 1. Added telemetry (#277) 2. Configuration property changes 3. Bug fixes for a. unscheduled pods returning green 3b. Sometimes, the details hash of agg monitors are different because the order of elements inside the array is different, causing the records to be sent * Bug fix to remove unused variable (#281) * bug fixes for error * adding declaration to assignment * removing unused variable * Fix the WARN<->WARNING typo (#282) * Bug Fixes 1. telemetry send throwing exception if records not initialized 2. permissions error in on-prem clusters (#284) * Bug fixes 1. not writeable, telemetry error * Change to state_WS_dir * Fix Require relative revert (#287) * Bug Fixes for exceptions in telemetry, remove limit set check (#289) * Bug Fixes 10222019 * Initialize container_cpu_memory_records in fhmb * Added telemetry to investigate health exceptions * Set frozen_string_literal to true * Send event once per container when lookup is empty, or limit is an array * Unit Tests, Use RS and POD to determine workload * Fixed Node Condition Bug, added exception handling to return get_rs_owner_ref * Fix the bug where if a warning condition appears before fail condition, the node condition is reported as warning instead of fail. Also fix the node conditions state to consider unknown as a failure state (#292) * Fix for Nodes Aspect not showing up in draft cluster (#294) * Fix the issue where the health tree is inconsistent if a deployment is deleted (#295) * Rashmi/1 16 test (#297) * health deployment update * apps v1 changes for deployment * changes * changes to use relicasets and api groups * Fix duplicate records in container memory/cpu samples (#298) * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * fix exceptions (#306) * Merge Branch morgan into ci_feature (#308) * Fixes : 1) Disable health (for time being) - in DS & RS 2) Disable MDM (for time being) - in DS & RS 3) Merge kubeperf into kubenode & kubepod 4) Made scheduling predictable for kubenode & kubepod 5) Enable containerlog enrichment fields (timeofcommand, containername & containerimage) as a configurable setting (default = true/ON) - Also add telemetry for it 6) Filter OUT type!=Normal events for k8s events 7) AppInsights telemetry async 8) Fix double calling bug in in_win_cadvisor_perf 9) Add connect timeout (20secs) & read timeout (40 secs) for all cadvisor api calls & also for all kubernetes api server calls 10) Fix batchTime for kubepods to be one before making api server call (rather than after making the call, which will make it fluctuate based on api server latency for the call) * fix setting issue for the new enrichcontainerlog setting * fix compilation issue * fix another compilation issue * fix emit issues * fix a nil issue * fix mising tag * * Fix all input plugins for scheduling issue * Merge kubeservices with kubepodinventory (reduce RS to API server by one more) * Remove Kubelogs (not used) * Fix liveness probe * Disable enrichment by default for container logs * Move to yajl json parser across the board for docker provier code * Remove unused files * fix removed files * fix timeofcommand and remove a duplicate entry for a health file. * Rashmi/http leak fixes (#301) * changes for http connection close * close socket in ensure * adding nil check * Rashmi/http leak fixes (#303) * changes for http connection close * close socket in ensure * adding nil check * adding missing end * use yajl for events & nodes parsing. * Rashmi/http leak fixes (#304) * changes for http connection close * close socket in ensure * adding nil check * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * adding missing end * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * changes for chunking * telemetry changes * some fixes * bug fix * changing to have morgan changes only * add new line * use polltime for metrics and disable out_forward for health * enable mdm & health * few optimizations * do not remove time of command make kube.conf same as scale tested config * remove comments from container.conf * remove flush comment for ai telemetry * remove commented code lines * fix config * remove timeofcommand when enrichment==false * fix config * enable mdm filter * Rashmi/api chunk (#307) * changes * changes * refactor changes * changes * changes * changes * changes * node changes * changes * changes * changes * changes * adding open and read timeouts for api client * removing comments * updating chunk size * Update Readme * add back timeofcommand (#310) * update readme for timeofcommand fix (#314) * Merge from ci_feature_prod into ci_feature (fix put back timeofcommand) (#311) (#316) * Updatng release history * fixing the plugin logs for emit stream * updating log message * Remove Log Processing from fluentd configuration * Remove plugin references from base_container.data * Dilipr/fluent bit log processing (#126) * Build out_oms.so and include in docker-cimprov package * Adding fluent-bit-config file to base container * PR Feedback * Adding out_oms.conf to base_container.data * PR Feedback * Making the critical section as small as possible * PR Feedback * Fixing the newline bug for Computer, and changing containerId to Id * Dilipr/glide updates (#127) * Updating glide.* files to include lumberjack * containerID="" for pull issues * Using KubeAPI for getting image,name. Adding more logs (#129) * Using KubeAPI for getting image,name. Adding more logs * Moving log file and state file to within the omsagent container * Changing log and state paths * Dilipr/mark comments (#130) * Marks Comments + Error Handling * Drop records from files that are not in k8s format * Remove unnecessary log line' * Adding Log to the file that doesn't conform to the expected format * Rashmi/segfault latest (#132) * adding null checks in all providers * fixing type * fixing type * adding more null checks * update cjson * Adding a missed null check (#135) * reusing some variables (#136) * Rashmi/cjson delete null check (#138) * adding null check for cjson-delete * null chk * removing null check * updating log level to debug for some provider workflows (#139) * Fixing CPU Utilization and removing Fluent-bit filters (#140) Removing fluent-bit filters, CPU optimizations * Minor tweaks 1. Remove some logging 2. Added more Error Handling 3. Continue when there is an error with k8s api (#141) * Removing some logs, added more error checking, continue on kube-api error * Return FLB OK for json Marshall error, instead of RETRY * * Change FluentBit flush interval to 30 secs (from 5 secs) * Remove ContainerPerf, ContainerServiceLog,ContainerProcess (OMI workflows) for Daemonset * Container Log Telemetry * Fixing an issue with Send Init Event if Telemetry is not initialized properly, tab to whitespace in conf file * PR feedback * PR feedback * Sending an event every 5 mins(Heartbeat) (#146) * PR feedback to cleanup removed workflows * updating agent version for telemetry * updating agent version * Telemetry Updates (#149) * Telemetry Fixes 1. Added Log Generation Rate 2. Fixed parsing bugs 3. Added code to send Exceptions/errors * PR Feedback * Changes to send omsagent/omsagent-rs kubectl logs to App Insights (#159) * Changes to send omsagent/omsagent-rs kubectl logs to App Insights * PR Feedback * Rashmi/fluentd docker inventory (#160) * first stab * changes * changes * docker util changes * working tested util * input plugin and conf * changes * changes * changes * changes * changes * working containerinventory * fixing omi removal from container.conf * removing comments * file write and read * deleted containers working * changes * changes * socket timeout * deleting test files * adding log * fixing comment * appinsights changes * changes * tel changes * changes * changes * changes * changes * lib changes * changes * changes * fixes * PR comments * changes * updating the ownership * changes * changes * changes to container data * removing comment * changes * adding collection time * bug fix * env string truncation * changes for acs-engine test * Fix Telemetry Bug -- Initialize Telemetry Client after Initializing all required properties (#162) * Fix kube events memory leak due to yaml serialization for > 5k events (#163) * Setting Timeout for HTTP Client in PostDataHelper in outoms go plugin(#164) * Vishwa/perftelemetry 2 (#165) * add cpu usage telemetry for ds & rs * add cpu & memory usage telemetry for ds & rs * environment variable fix (#166) * environment variable fix * updating agent version * Fixing a bug where we were crashing due to container statuses not present when not was lost (#167) * Updating title * updating right versions for last release * Updating the break condition to look for end of response (#168) * Updating the break condition to look for end of response * changes for docker response * updating AgentVersion for telemetry * Updating readme for latest release changes * Changes - (#173) * use /var/log for state * new metric ContainerLogsAgentSideLatencyMs * new field 'timeOfComand' * Rashmi/kubenodeinventory (#174) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * Get cpuusage from usageseconds (#175) * Rashmi/kubenodeinventory (#176) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * Rashmi/kubenodeinventory (#178) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * Fixing an issue on the cpurate metric, which happens for the first time (when cache is empty) (#179) * Rashmi/kubenodeinventory (#180) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Exclude docker containers from container inventory (#181) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * Exclude pauseamd64 containers from container inventory (#182) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * Update agent version * Updating readme for the latest release * Fix indentation in kube.conf and update readme (#184) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * fixing indentation so that kube.conf contents can be used in config map in the yaml * updating readme to fix date and agent version * updating agent tag * Get Pods for current Node Only (#185) * Fix KubeAPI Calls to filter to get pods for current node * Reinstate log line * changes for container node inventory fixed type (#186) * Fix for mooncake (disable telemetry optionally) (#191) * disable telemetry option * fix a typo * CustomMetrics to ci_feature (#193) Custom Metrics changes to ci_feature * add ContainerNotRunning column to KubePodInventory * merge pr feedback: update name to ContainerStatusReason * Zero Fill for Missing Pod Phases, Change Namespace Dimension to Kubernetes namespace, as it might be confused with metrics namespace in Metrics Explorer (#194) * Zero Fill for Pod Counts by Phase * Change namespace dimension to Kubernetes namespace * No Retries for non 404 4xx errors (#196) * Update agent version for telemetry * Update readme for upcoming (ciprod01202019) release * fix readme formatting * fix formatting for readme * fix formatting for readme * fix readme * fix readme * fix agent version for telemetry * fix date in readme * update readme * Restart logs every 10MB instead of weekly (#198) * Rotate logs every 10MB instead of weekly * Removing some logging, fixed log rotation * update agent version for telemetry * update readme * Update kube.conf to use %STATE_DIR_WS% instead of hardcoded path * Fix AKSEngine Crash (#200) * hotfix * close resp.Body * remove chatty logs * membuf=5m and ignore files not updated since 5 mins * fix readme for new version * Fix the pod count in mdm agent plugin (#203) * Update readme * string freeze for out_mdm plugin * Vishwa/resourcecentric (#208) * resourceid fix (for AKS only) * fix name * Rashmi/win nodepool - PR (#206) * changes for win nodes enumeration * changes * changes * changes * node cpu metric rate changes * container cpu rate * changes * changes * changes * changes * changes * changes to include in_win_cadvisor_perf.rb file * send containerinventoryheartbeatevent * changes * cahnges for mdm metrics * changes * cahnges * changes * container states * changes * changes * changes for env variables * changes * changes * changes * changes * delete comments * changes * mutex changes * changes * changes * changes * telemetry fix for docker version * removing hardcoded values for mdm * update docker version * telemetry for windows cadvisor timeouts * exeception key update to computer * PR comments * adding os to container inventory for windows nodes (#210) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors (#211) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors * Fixing the bug, deferring telemetry changes for later * updating to lowercase compare for units (#212) * Merge from vishwa/telegraftcp to ci_feature for telegraf changes (#214) * merge from Vishwa/telegraf to Vishwa/telegraftcp for telegraf changes (#207) * add configuration for telegraf * fix for perms * fix telegraf config. * fix file location & config * update to config * fix namespace * trying different namespace and also debug=true * add placeholder for nodename * change namespace * updated config * fix uri * fix azMon settings * remove aad settings * add custom metrics regions * fix config * add support for replica-set config * fix oomkilled * Add telegraf 403 metric telemetry & non 403 trace telemetry * fix type * fix package * fix package import * fix filename * delete unused file * conf file for rs; fix 403counttotal metric for telegraf, remove host and use nodeName consistently, rename metrics * fix statefulsets * fix typo. * fix another typo. * fix telemetry * fix casing issue * fix comma issue. * disable telemetry for rs ; fix stateful set name * worksround for namespace fix * telegraf integration - v1 * telemetry changes for telegraf * telemetry & other changes * remove custom metric regions as we dont need anymore * remove un-needed files * fixes * exclude certain volumes and fix telemetry to not have computer & nodename as dimensions (redundant) * Vishwa/resourcecentric (#208) (#209) * resourceid fix (for AKS only) * fix name * near final metric shape * change from customlog to fixed type (InsightsMetrics) * fix PR feedback * fix pr feedback * Fix telemetry error for telegraf err count metric (#215) * Fix Unscheduled Pod bug, remove excess telemetry (#218) * Fix Unscheduled Pod bug, remove excess telemetry * Send Success Telemetry only once after startup for a node in a cluster for MDM Post * Sending telemetry for successful push to MDM every hour * Merge from Vishwa/promstandardmetrics into ci_feature (#220) * enable prometheus metrics collection in replica-set * fixing typos * fix config file path for replicaset * fix configuration * config changes * merge config/settings to ci_feature (#221) * updating fluentbit to use LOG_TAIL_PATH * changes * log exclusion pattern * changes * removing comments * adding enviornment varibale collection/disable * disable env var for cluster variable change * changes * toml parser changes * adding directory tomlrb * changes for container inventory * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Telemetry for config overrides * add schema version telemetry * reduce the number of api calls for namespace filtering add more telemetry for config processing move liveness probe & parser to this repo * optimize for default kube-system namespace log collection exclusion * Fix Scenario when Controller name is empty (#222) * fix ; * ContainerLog collection optimizations (#223) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * merge final changes for release from Vishwa/june2019agentrel to ci_feature (#224) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * fix a minor comment * * change flush from 5 to 10 secs based on perf findings * fix fluent bit tuning for perf run (#226) * fix fluent bit tuning for perf run * stop collecting our own partition * fix merge issue * add release notes for june release in ci_feature branch * fix title * update * fix title * Trim spaces in AKS_REGION (#233) This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Add Logs Size To Telemetry (#234) * Add Logs to telemetry * Using len instead of unsafe.Sizeof * Merge Vishwa/promcustommetrics to ci_feature (#237) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * Fix Region space error (#239) * Trim spaces in AKS_REGION This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Fix out_mdm parsing error * Removing buffer chunk size and buffer max size from fluentbit conf (#240) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * removing buffer chunk size and buffer max size from fluentbit conf * changes (#243) * Collect container last state (#235) * updating the OMS agent to also collect container last state * changed a comment * git surrounded ContainerLastStatus code in a begin/rescue block * added a lot of error checking and logging * Rashmi/fix prom telemetry (#247) * fix prom telemetry * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Merge Health Model work into ci_feature behind a feature flag Pending perf testing (#246) Merge Health to ci_feature * Fix Deserialization Bug (#249) * Fix the bug where capacity is not updated and cached value was being used (#251) * Fix the Capacity computation * fix node cpu and memory limits calculation * changes (#250) * Added new Custom Metrics Regions, fixed MDM plugin crash bug (#253) Added new regions, added handler for MDM plugin start * Add Missing Handlers (#254) * Added Missing Handlers * Return MultiEventStream.new instead of empty array (#256) * Added explicit require_relative to avoid loading errors (#258) * Adding explicit require_relative * Gangams/enable ai telemetry in mc (#252) * enable ai telemetry to configure different ikey and endpoint per cloud * Fixing null check out_mdm bug, tomlparser bug, exposing Replica Set service name as an ENV variable (#261) * Expose replica set service as an env variable * Fixing null check out_mdm bug, and tomlparser bug * Updating the env variable name to be more specific to health model * Changes for creating custom plugins with namespace settings for prometheus scraping (#262) * changes * changes * changes * changes * changes * changes * chnages * changes * telemetry changes * changes * Cherry-pick hotfix 09092019 to ci_feature (#265) * Gangams/add telemetry hybrid (#264) * add telemetry to detect the cloud, distro and kernel version * add null check since providerId optional * detect azurestack cloud * rename to KubernetesProviderID since ProviderID name already used in LA * capture workspaceCloud to the telemetry * trim the domain read from file * KubeMonAgentEvents changes to collect configuration events (#267) * changes * changes * changes * changes * changes * changes * env changes * changes * changes * changes * reverting * changes * cahnges * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * chnages * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Fix the Dupe Perf Data Issue from the DaemonSet (#266) * Dupe Perf Record Fix * PR for 1. Container Memory CPU monitor 2. Configuration for Node Conditions 3. Fixed Type Changes 4. Use Env variable, and health_forward (that handles network errors at init) 5. Unit Tests (#268) * init containers fix and other bug fixes (#269) * init container - KPI and kubeperf changes * changes * changes * changes * changes for empty array fix * changes * changes * pod inventory exception fix * nil check changes * changes * fixing typo * changes * changes * PR - feedback * remove comment * tag pass changes * changes * tagdrop changes * changes * changes * Send agg monitor signal on details change (#270) send when an agg monitor details change, but state did not change * bug fixes for error (#274) * Fix to use declaration and assignment instead of assignment (#275) * bug fixes for error * adding declaration to assignment * 1. Added telemetry (#277) 2. Configuration property changes 3. Bug fixes for a. unscheduled pods returning green 3b. Sometimes, the details hash of agg monitors are different because the order of elements inside the array is different, causing the records to be sent * Bug fix to remove unused variable (#281) * bug fixes for error * adding declaration to assignment * removing unused variable * Fix the WARN<->WARNING typo (#282) * Bug Fixes 1. telemetry send throwing exception if records not initialized 2. permissions error in on-prem clusters (#284) * Bug fixes 1. not writeable, telemetry error * Change to state_WS_dir * Fix Require relative revert (#287) * Bug Fixes for exceptions in telemetry, remove limit set check (#289) * Bug Fixes 10222019 * Initialize container_cpu_memory_records in fhmb * Added telemetry to investigate health exceptions * Set frozen_string_literal to true * Send event once per container when lookup is empty, or limit is an array * Unit Tests, Use RS and POD to determine workload * Fixed Node Condition Bug, added exception handling to return get_rs_owner_ref * Fix the bug where if a warning condition appears before fail condition, the node condition is reported as warning instead of fail. Also fix the node conditions state to consider unknown as a failure state (#292) * Fix for Nodes Aspect not showing up in draft cluster (#294) * Fix the issue where the health tree is inconsistent if a deployment is deleted (#295) * Rashmi/1 16 test (#297) * health deployment update * apps v1 changes for deployment * changes * changes to use relicasets and api groups * Fix duplicate records in container memory/cpu samples (#298) * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * fix exceptions (#306) * Merge Branch morgan into ci_feature (#308) * Fixes : 1) Disable health (for time being) - in DS & RS 2) Disable MDM (for time being) - in DS & RS 3) Merge kubeperf into kubenode & kubepod 4) Made scheduling predictable for kubenode & kubepod 5) Enable containerlog enrichment fields (timeofcommand, containername & containerimage) as a configurable setting (default = true/ON) - Also add telemetry for it 6) Filter OUT type!=Normal events for k8s events 7) AppInsights telemetry async 8) Fix double calling bug in in_win_cadvisor_perf 9) Add connect timeout (20secs) & read timeout (40 secs) for all cadvisor api calls & also for all kubernetes api server calls 10) Fix batchTime for kubepods to be one before making api server call (rather than after making the call, which will make it fluctuate based on api server latency for the call) * fix setting issue for the new enrichcontainerlog setting * fix compilation issue * fix another compilation issue * fix emit issues * fix a nil issue * fix mising tag * * Fix all input plugins for scheduling issue * Merge kubeservices with kubepodinventory (reduce RS to API server by one more) * Remove Kubelogs (not used) * Fix liveness probe * Disable enrichment by default for container logs * Move to yajl json parser across the board for docker provier code * Remove unused files * fix removed files * fix timeofcommand and remove a duplicate entry for a health file. * Rashmi/http leak fixes (#301) * changes for http connection close * close socket in ensure * adding nil check * Rashmi/http leak fixes (#303) * changes for http connection close * close socket in ensure * adding nil check * adding missing end * use yajl for events & nodes parsing. * Rashmi/http leak fixes (#304) * changes for http connection close * close socket in ensure * adding nil check * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * adding missing end * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * changes for chunking * telemetry changes * some fixes * bug fix * changing to have morgan changes only * add new line * use polltime for metrics and disable out_forward for health * enable mdm & health * few optimizations * do not remove time of command make kube.conf same as scale tested config * remove comments from container.conf * remove flush comment for ai telemetry * remove commented code lines * fix config * remove timeofcommand when enrichment==false * fix config * enable mdm filter * Rashmi/api chunk (#307) * changes * changes * refactor changes * changes * changes * changes * changes * node changes * changes * changes * changes * changes * adding open and read timeouts for api client * removing comments * updating chunk size * Update Readme * add back timeofcommand (#310)
ayusheesingh-zz
pushed a commit
that referenced
this pull request
Jun 27, 2020
* Updatng release history * fixing the plugin logs for emit stream * updating log message * Remove Log Processing from fluentd configuration * Remove plugin references from base_container.data * Dilipr/fluent bit log processing (#126) * Build out_oms.so and include in docker-cimprov package * Adding fluent-bit-config file to base container * PR Feedback * Adding out_oms.conf to base_container.data * PR Feedback * Making the critical section as small as possible * PR Feedback * Fixing the newline bug for Computer, and changing containerId to Id * Dilipr/glide updates (#127) * Updating glide.* files to include lumberjack * containerID="" for pull issues * Using KubeAPI for getting image,name. Adding more logs (#129) * Using KubeAPI for getting image,name. Adding more logs * Moving log file and state file to within the omsagent container * Changing log and state paths * Dilipr/mark comments (#130) * Marks Comments + Error Handling * Drop records from files that are not in k8s format * Remove unnecessary log line' * Adding Log to the file that doesn't conform to the expected format * Rashmi/segfault latest (#132) * adding null checks in all providers * fixing type * fixing type * adding more null checks * update cjson * Adding a missed null check (#135) * reusing some variables (#136) * Rashmi/cjson delete null check (#138) * adding null check for cjson-delete * null chk * removing null check * updating log level to debug for some provider workflows (#139) * Fixing CPU Utilization and removing Fluent-bit filters (#140) Removing fluent-bit filters, CPU optimizations * Minor tweaks 1. Remove some logging 2. Added more Error Handling 3. Continue when there is an error with k8s api (#141) * Removing some logs, added more error checking, continue on kube-api error * Return FLB OK for json Marshall error, instead of RETRY * * Change FluentBit flush interval to 30 secs (from 5 secs) * Remove ContainerPerf, ContainerServiceLog,ContainerProcess (OMI workflows) for Daemonset * Container Log Telemetry * Fixing an issue with Send Init Event if Telemetry is not initialized properly, tab to whitespace in conf file * PR feedback * PR feedback * Sending an event every 5 mins(Heartbeat) (#146) * PR feedback to cleanup removed workflows * updating agent version for telemetry * updating agent version * Telemetry Updates (#149) * Telemetry Fixes 1. Added Log Generation Rate 2. Fixed parsing bugs 3. Added code to send Exceptions/errors * PR Feedback * Changes to send omsagent/omsagent-rs kubectl logs to App Insights (#159) * Changes to send omsagent/omsagent-rs kubectl logs to App Insights * PR Feedback * Rashmi/fluentd docker inventory (#160) * first stab * changes * changes * docker util changes * working tested util * input plugin and conf * changes * changes * changes * changes * changes * working containerinventory * fixing omi removal from container.conf * removing comments * file write and read * deleted containers working * changes * changes * socket timeout * deleting test files * adding log * fixing comment * appinsights changes * changes * tel changes * changes * changes * changes * changes * lib changes * changes * changes * fixes * PR comments * changes * updating the ownership * changes * changes * changes to container data * removing comment * changes * adding collection time * bug fix * env string truncation * changes for acs-engine test * Fix Telemetry Bug -- Initialize Telemetry Client after Initializing all required properties (#162) * Fix kube events memory leak due to yaml serialization for > 5k events (#163) * Setting Timeout for HTTP Client in PostDataHelper in outoms go plugin(#164) * Vishwa/perftelemetry 2 (#165) * add cpu usage telemetry for ds & rs * add cpu & memory usage telemetry for ds & rs * environment variable fix (#166) * environment variable fix * updating agent version * Fixing a bug where we were crashing due to container statuses not present when not was lost (#167) * Updating title * updating right versions for last release * Updating the break condition to look for end of response (#168) * Updating the break condition to look for end of response * changes for docker response * updating AgentVersion for telemetry * Updating readme for latest release changes * Changes - (#173) * use /var/log for state * new metric ContainerLogsAgentSideLatencyMs * new field 'timeOfComand' * Rashmi/kubenodeinventory (#174) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * Get cpuusage from usageseconds (#175) * Rashmi/kubenodeinventory (#176) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * Rashmi/kubenodeinventory (#178) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * Fixing an issue on the cpurate metric, which happens for the first time (when cache is empty) (#179) * Rashmi/kubenodeinventory (#180) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Exclude docker containers from container inventory (#181) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * Exclude pauseamd64 containers from container inventory (#182) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * Update agent version * Updating readme for the latest release * Fix indentation in kube.conf and update readme (#184) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * fixing indentation so that kube.conf contents can be used in config map in the yaml * updating readme to fix date and agent version * updating agent tag * Get Pods for current Node Only (#185) * Fix KubeAPI Calls to filter to get pods for current node * Reinstate log line * changes for container node inventory fixed type (#186) * Fix for mooncake (disable telemetry optionally) (#191) * disable telemetry option * fix a typo * CustomMetrics to ci_feature (#193) Custom Metrics changes to ci_feature * add ContainerNotRunning column to KubePodInventory * merge pr feedback: update name to ContainerStatusReason * Zero Fill for Missing Pod Phases, Change Namespace Dimension to Kubernetes namespace, as it might be confused with metrics namespace in Metrics Explorer (#194) * Zero Fill for Pod Counts by Phase * Change namespace dimension to Kubernetes namespace * No Retries for non 404 4xx errors (#196) * Update agent version for telemetry * Update readme for upcoming (ciprod01202019) release * fix readme formatting * fix formatting for readme * fix formatting for readme * fix readme * fix readme * fix agent version for telemetry * fix date in readme * update readme * Restart logs every 10MB instead of weekly (#198) * Rotate logs every 10MB instead of weekly * Removing some logging, fixed log rotation * update agent version for telemetry * update readme * Update kube.conf to use %STATE_DIR_WS% instead of hardcoded path * Fix AKSEngine Crash (#200) * hotfix * close resp.Body * remove chatty logs * membuf=5m and ignore files not updated since 5 mins * fix readme for new version * Fix the pod count in mdm agent plugin (#203) * Update readme * string freeze for out_mdm plugin * Vishwa/resourcecentric (#208) * resourceid fix (for AKS only) * fix name * Rashmi/win nodepool - PR (#206) * changes for win nodes enumeration * changes * changes * changes * node cpu metric rate changes * container cpu rate * changes * changes * changes * changes * changes * changes to include in_win_cadvisor_perf.rb file * send containerinventoryheartbeatevent * changes * cahnges for mdm metrics * changes * cahnges * changes * container states * changes * changes * changes for env variables * changes * changes * changes * changes * delete comments * changes * mutex changes * changes * changes * changes * telemetry fix for docker version * removing hardcoded values for mdm * update docker version * telemetry for windows cadvisor timeouts * exeception key update to computer * PR comments * adding os to container inventory for windows nodes (#210) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors (#211) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors * Fixing the bug, deferring telemetry changes for later * updating to lowercase compare for units (#212) * Merge from vishwa/telegraftcp to ci_feature for telegraf changes (#214) * merge from Vishwa/telegraf to Vishwa/telegraftcp for telegraf changes (#207) * add configuration for telegraf * fix for perms * fix telegraf config. * fix file location & config * update to config * fix namespace * trying different namespace and also debug=true * add placeholder for nodename * change namespace * updated config * fix uri * fix azMon settings * remove aad settings * add custom metrics regions * fix config * add support for replica-set config * fix oomkilled * Add telegraf 403 metric telemetry & non 403 trace telemetry * fix type * fix package * fix package import * fix filename * delete unused file * conf file for rs; fix 403counttotal metric for telegraf, remove host and use nodeName consistently, rename metrics * fix statefulsets * fix typo. * fix another typo. * fix telemetry * fix casing issue * fix comma issue. * disable telemetry for rs ; fix stateful set name * worksround for namespace fix * telegraf integration - v1 * telemetry changes for telegraf * telemetry & other changes * remove custom metric regions as we dont need anymore * remove un-needed files * fixes * exclude certain volumes and fix telemetry to not have computer & nodename as dimensions (redundant) * Vishwa/resourcecentric (#208) (#209) * resourceid fix (for AKS only) * fix name * near final metric shape * change from customlog to fixed type (InsightsMetrics) * fix PR feedback * fix pr feedback * Fix telemetry error for telegraf err count metric (#215) * Fix Unscheduled Pod bug, remove excess telemetry (#218) * Fix Unscheduled Pod bug, remove excess telemetry * Send Success Telemetry only once after startup for a node in a cluster for MDM Post * Sending telemetry for successful push to MDM every hour * Merge from Vishwa/promstandardmetrics into ci_feature (#220) * enable prometheus metrics collection in replica-set * fixing typos * fix config file path for replicaset * fix configuration * config changes * merge config/settings to ci_feature (#221) * updating fluentbit to use LOG_TAIL_PATH * changes * log exclusion pattern * changes * removing comments * adding enviornment varibale collection/disable * disable env var for cluster variable change * changes * toml parser changes * adding directory tomlrb * changes for container inventory * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Telemetry for config overrides * add schema version telemetry * reduce the number of api calls for namespace filtering add more telemetry for config processing move liveness probe & parser to this repo * optimize for default kube-system namespace log collection exclusion * Fix Scenario when Controller name is empty (#222) * fix ; * ContainerLog collection optimizations (#223) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * merge final changes for release from Vishwa/june2019agentrel to ci_feature (#224) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * fix a minor comment * * change flush from 5 to 10 secs based on perf findings * fix fluent bit tuning for perf run (#226) * fix fluent bit tuning for perf run * stop collecting our own partition * fix merge issue * add release notes for june release in ci_feature branch * fix title * update * fix title * Trim spaces in AKS_REGION (#233) This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Add Logs Size To Telemetry (#234) * Add Logs to telemetry * Using len instead of unsafe.Sizeof * Merge Vishwa/promcustommetrics to ci_feature (#237) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * Fix Region space error (#239) * Trim spaces in AKS_REGION This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Fix out_mdm parsing error * Removing buffer chunk size and buffer max size from fluentbit conf (#240) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * removing buffer chunk size and buffer max size from fluentbit conf * changes (#243) * Collect container last state (#235) * updating the OMS agent to also collect container last state * changed a comment * git surrounded ContainerLastStatus code in a begin/rescue block * added a lot of error checking and logging * Rashmi/fix prom telemetry (#247) * fix prom telemetry * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Merge Health Model work into ci_feature behind a feature flag Pending perf testing (#246) Merge Health to ci_feature * Fix Deserialization Bug (#249) * Fix the bug where capacity is not updated and cached value was being used (#251) * Fix the Capacity computation * fix node cpu and memory limits calculation * changes (#250) * Added new Custom Metrics Regions, fixed MDM plugin crash bug (#253) Added new regions, added handler for MDM plugin start * Add Missing Handlers (#254) * Added Missing Handlers * Return MultiEventStream.new instead of empty array (#256) * Added explicit require_relative to avoid loading errors (#258) * Adding explicit require_relative * Gangams/enable ai telemetry in mc (#252) * enable ai telemetry to configure different ikey and endpoint per cloud * Fixing null check out_mdm bug, tomlparser bug, exposing Replica Set service name as an ENV variable (#261) * Expose replica set service as an env variable * Fixing null check out_mdm bug, and tomlparser bug * Updating the env variable name to be more specific to health model * Changes for creating custom plugins with namespace settings for prometheus scraping (#262) * changes * changes * changes * changes * changes * changes * chnages * changes * telemetry changes * changes * Cherry-pick hotfix 09092019 to ci_feature (#265) * Gangams/add telemetry hybrid (#264) * add telemetry to detect the cloud, distro and kernel version * add null check since providerId optional * detect azurestack cloud * rename to KubernetesProviderID since ProviderID name already used in LA * capture workspaceCloud to the telemetry * trim the domain read from file * KubeMonAgentEvents changes to collect configuration events (#267) * changes * changes * changes * changes * changes * changes * env changes * changes * changes * changes * reverting * changes * cahnges * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * chnages * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Fix the Dupe Perf Data Issue from the DaemonSet (#266) * Dupe Perf Record Fix * PR for 1. Container Memory CPU monitor 2. Configuration for Node Conditions 3. Fixed Type Changes 4. Use Env variable, and health_forward (that handles network errors at init) 5. Unit Tests (#268) * init containers fix and other bug fixes (#269) * init container - KPI and kubeperf changes * changes * changes * changes * changes for empty array fix * changes * changes * pod inventory exception fix * nil check changes * changes * fixing typo * changes * changes * PR - feedback * remove comment * tag pass changes * changes * tagdrop changes * changes * changes * Send agg monitor signal on details change (#270) send when an agg monitor details change, but state did not change * bug fixes for error (#274) * Fix to use declaration and assignment instead of assignment (#275) * bug fixes for error * adding declaration to assignment * 1. Added telemetry (#277) 2. Configuration property changes 3. Bug fixes for a. unscheduled pods returning green 3b. Sometimes, the details hash of agg monitors are different because the order of elements inside the array is different, causing the records to be sent * Bug fix to remove unused variable (#281) * bug fixes for error * adding declaration to assignment * removing unused variable * Fix the WARN<->WARNING typo (#282) * Bug Fixes 1. telemetry send throwing exception if records not initialized 2. permissions error in on-prem clusters (#284) * Bug fixes 1. not writeable, telemetry error * Change to state_WS_dir * Fix Require relative revert (#287) * Bug Fixes for exceptions in telemetry, remove limit set check (#289) * Bug Fixes 10222019 * Initialize container_cpu_memory_records in fhmb * Added telemetry to investigate health exceptions * Set frozen_string_literal to true * Send event once per container when lookup is empty, or limit is an array * Unit Tests, Use RS and POD to determine workload * Fixed Node Condition Bug, added exception handling to return get_rs_owner_ref * Fix the bug where if a warning condition appears before fail condition, the node condition is reported as warning instead of fail. Also fix the node conditions state to consider unknown as a failure state (#292) * Fix for Nodes Aspect not showing up in draft cluster (#294) * Fix the issue where the health tree is inconsistent if a deployment is deleted (#295) * Rashmi/1 16 test (#297) * health deployment update * apps v1 changes for deployment * changes * changes to use relicasets and api groups * Fix duplicate records in container memory/cpu samples (#298) * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * fix exceptions (#306) * Merge Branch morgan into ci_feature (#308) * Fixes : 1) Disable health (for time being) - in DS & RS 2) Disable MDM (for time being) - in DS & RS 3) Merge kubeperf into kubenode & kubepod 4) Made scheduling predictable for kubenode & kubepod 5) Enable containerlog enrichment fields (timeofcommand, containername & containerimage) as a configurable setting (default = true/ON) - Also add telemetry for it 6) Filter OUT type!=Normal events for k8s events 7) AppInsights telemetry async 8) Fix double calling bug in in_win_cadvisor_perf 9) Add connect timeout (20secs) & read timeout (40 secs) for all cadvisor api calls & also for all kubernetes api server calls 10) Fix batchTime for kubepods to be one before making api server call (rather than after making the call, which will make it fluctuate based on api server latency for the call) * fix setting issue for the new enrichcontainerlog setting * fix compilation issue * fix another compilation issue * fix emit issues * fix a nil issue * fix mising tag * * Fix all input plugins for scheduling issue * Merge kubeservices with kubepodinventory (reduce RS to API server by one more) * Remove Kubelogs (not used) * Fix liveness probe * Disable enrichment by default for container logs * Move to yajl json parser across the board for docker provier code * Remove unused files * fix removed files * fix timeofcommand and remove a duplicate entry for a health file. * Rashmi/http leak fixes (#301) * changes for http connection close * close socket in ensure * adding nil check * Rashmi/http leak fixes (#303) * changes for http connection close * close socket in ensure * adding nil check * adding missing end * use yajl for events & nodes parsing. * Rashmi/http leak fixes (#304) * changes for http connection close * close socket in ensure * adding nil check * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * adding missing end * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * changes for chunking * telemetry changes * some fixes * bug fix * changing to have morgan changes only * add new line * use polltime for metrics and disable out_forward for health * enable mdm & health * few optimizations * do not remove time of command make kube.conf same as scale tested config * remove comments from container.conf * remove flush comment for ai telemetry * remove commented code lines * fix config * remove timeofcommand when enrichment==false * fix config * enable mdm filter * Rashmi/api chunk (#307) * changes * changes * refactor changes * changes * changes * changes * changes * node changes * changes * changes * changes * changes * adding open and read timeouts for api client * removing comments * updating chunk size * Update Readme * add back timeofcommand (#310) * update readme for timeofcommand fix (#314) * Merge from ci_feature_prod into ci_feature (fix put back timeofcommand) (#311) (#316) * Updatng release history * fixing the plugin logs for emit stream * updating log message * Remove Log Processing from fluentd configuration * Remove plugin references from base_container.data * Dilipr/fluent bit log processing (#126) * Build out_oms.so and include in docker-cimprov package * Adding fluent-bit-config file to base container * PR Feedback * Adding out_oms.conf to base_container.data * PR Feedback * Making the critical section as small as possible * PR Feedback * Fixing the newline bug for Computer, and changing containerId to Id * Dilipr/glide updates (#127) * Updating glide.* files to include lumberjack * containerID="" for pull issues * Using KubeAPI for getting image,name. Adding more logs (#129) * Using KubeAPI for getting image,name. Adding more logs * Moving log file and state file to within the omsagent container * Changing log and state paths * Dilipr/mark comments (#130) * Marks Comments + Error Handling * Drop records from files that are not in k8s format * Remove unnecessary log line' * Adding Log to the file that doesn't conform to the expected format * Rashmi/segfault latest (#132) * adding null checks in all providers * fixing type * fixing type * adding more null checks * update cjson * Adding a missed null check (#135) * reusing some variables (#136) * Rashmi/cjson delete null check (#138) * adding null check for cjson-delete * null chk * removing null check * updating log level to debug for some provider workflows (#139) * Fixing CPU Utilization and removing Fluent-bit filters (#140) Removing fluent-bit filters, CPU optimizations * Minor tweaks 1. Remove some logging 2. Added more Error Handling 3. Continue when there is an error with k8s api (#141) * Removing some logs, added more error checking, continue on kube-api error * Return FLB OK for json Marshall error, instead of RETRY * * Change FluentBit flush interval to 30 secs (from 5 secs) * Remove ContainerPerf, ContainerServiceLog,ContainerProcess (OMI workflows) for Daemonset * Container Log Telemetry * Fixing an issue with Send Init Event if Telemetry is not initialized properly, tab to whitespace in conf file * PR feedback * PR feedback * Sending an event every 5 mins(Heartbeat) (#146) * PR feedback to cleanup removed workflows * updating agent version for telemetry * updating agent version * Telemetry Updates (#149) * Telemetry Fixes 1. Added Log Generation Rate 2. Fixed parsing bugs 3. Added code to send Exceptions/errors * PR Feedback * Changes to send omsagent/omsagent-rs kubectl logs to App Insights (#159) * Changes to send omsagent/omsagent-rs kubectl logs to App Insights * PR Feedback * Rashmi/fluentd docker inventory (#160) * first stab * changes * changes * docker util changes * working tested util * input plugin and conf * changes * changes * changes * changes * changes * working containerinventory * fixing omi removal from container.conf * removing comments * file write and read * deleted containers working * changes * changes * socket timeout * deleting test files * adding log * fixing comment * appinsights changes * changes * tel changes * changes * changes * changes * changes * lib changes * changes * changes * fixes * PR comments * changes * updating the ownership * changes * changes * changes to container data * removing comment * changes * adding collection time * bug fix * env string truncation * changes for acs-engine test * Fix Telemetry Bug -- Initialize Telemetry Client after Initializing all required properties (#162) * Fix kube events memory leak due to yaml serialization for > 5k events (#163) * Setting Timeout for HTTP Client in PostDataHelper in outoms go plugin(#164) * Vishwa/perftelemetry 2 (#165) * add cpu usage telemetry for ds & rs * add cpu & memory usage telemetry for ds & rs * environment variable fix (#166) * environment variable fix * updating agent version * Fixing a bug where we were crashing due to container statuses not present when not was lost (#167) * Updating title * updating right versions for last release * Updating the break condition to look for end of response (#168) * Updating the break condition to look for end of response * changes for docker response * updating AgentVersion for telemetry * Updating readme for latest release changes * Changes - (#173) * use /var/log for state * new metric ContainerLogsAgentSideLatencyMs * new field 'timeOfComand' * Rashmi/kubenodeinventory (#174) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * Get cpuusage from usageseconds (#175) * Rashmi/kubenodeinventory (#176) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * Rashmi/kubenodeinventory (#178) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * Fixing an issue on the cpurate metric, which happens for the first time (when cache is empty) (#179) * Rashmi/kubenodeinventory (#180) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Exclude docker containers from container inventory (#181) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * Exclude pauseamd64 containers from container inventory (#182) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * Update agent version * Updating readme for the latest release * Fix indentation in kube.conf and update readme (#184) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * fixing indentation so that kube.conf contents can be used in config map in the yaml * updating readme to fix date and agent version * updating agent tag * Get Pods for current Node Only (#185) * Fix KubeAPI Calls to filter to get pods for current node * Reinstate log line * changes for container node inventory fixed type (#186) * Fix for mooncake (disable telemetry optionally) (#191) * disable telemetry option * fix a typo * CustomMetrics to ci_feature (#193) Custom Metrics changes to ci_feature * add ContainerNotRunning column to KubePodInventory * merge pr feedback: update name to ContainerStatusReason * Zero Fill for Missing Pod Phases, Change Namespace Dimension to Kubernetes namespace, as it might be confused with metrics namespace in Metrics Explorer (#194) * Zero Fill for Pod Counts by Phase * Change namespace dimension to Kubernetes namespace * No Retries for non 404 4xx errors (#196) * Update agent version for telemetry * Update readme for upcoming (ciprod01202019) release * fix readme formatting * fix formatting for readme * fix formatting for readme * fix readme * fix readme * fix agent version for telemetry * fix date in readme * update readme * Restart logs every 10MB instead of weekly (#198) * Rotate logs every 10MB instead of weekly * Removing some logging, fixed log rotation * update agent version for telemetry * update readme * Update kube.conf to use %STATE_DIR_WS% instead of hardcoded path * Fix AKSEngine Crash (#200) * hotfix * close resp.Body * remove chatty logs * membuf=5m and ignore files not updated since 5 mins * fix readme for new version * Fix the pod count in mdm agent plugin (#203) * Update readme * string freeze for out_mdm plugin * Vishwa/resourcecentric (#208) * resourceid fix (for AKS only) * fix name * Rashmi/win nodepool - PR (#206) * changes for win nodes enumeration * changes * changes * changes * node cpu metric rate changes * container cpu rate * changes * changes * changes * changes * changes * changes to include in_win_cadvisor_perf.rb file * send containerinventoryheartbeatevent * changes * cahnges for mdm metrics * changes * cahnges * changes * container states * changes * changes * changes for env variables * changes * changes * changes * changes * delete comments * changes * mutex changes * changes * changes * changes * telemetry fix for docker version * removing hardcoded values for mdm * update docker version * telemetry for windows cadvisor timeouts * exeception key update to computer * PR comments * adding os to container inventory for windows nodes (#210) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors (#211) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors * Fixing the bug, deferring telemetry changes for later * updating to lowercase compare for units (#212) * Merge from vishwa/telegraftcp to ci_feature for telegraf changes (#214) * merge from Vishwa/telegraf to Vishwa/telegraftcp for telegraf changes (#207) * add configuration for telegraf * fix for perms * fix telegraf config. * fix file location & config * update to config * fix namespace * trying different namespace and also debug=true * add placeholder for nodename * change namespace * updated config * fix uri * fix azMon settings * remove aad settings * add custom metrics regions * fix config * add support for replica-set config * fix oomkilled * Add telegraf 403 metric telemetry & non 403 trace telemetry * fix type * fix package * fix package import * fix filename * delete unused file * conf file for rs; fix 403counttotal metric for telegraf, remove host and use nodeName consistently, rename metrics * fix statefulsets * fix typo. * fix another typo. * fix telemetry * fix casing issue * fix comma issue. * disable telemetry for rs ; fix stateful set name * worksround for namespace fix * telegraf integration - v1 * telemetry changes for telegraf * telemetry & other changes * remove custom metric regions as we dont need anymore * remove un-needed files * fixes * exclude certain volumes and fix telemetry to not have computer & nodename as dimensions (redundant) * Vishwa/resourcecentric (#208) (#209) * resourceid fix (for AKS only) * fix name * near final metric shape * change from customlog to fixed type (InsightsMetrics) * fix PR feedback * fix pr feedback * Fix telemetry error for telegraf err count metric (#215) * Fix Unscheduled Pod bug, remove excess telemetry (#218) * Fix Unscheduled Pod bug, remove excess telemetry * Send Success Telemetry only once after startup for a node in a cluster for MDM Post * Sending telemetry for successful push to MDM every hour * Merge from Vishwa/promstandardmetrics into ci_feature (#220) * enable prometheus metrics collection in replica-set * fixing typos * fix config file path for replicaset * fix configuration * config changes * merge config/settings to ci_feature (#221) * updating fluentbit to use LOG_TAIL_PATH * changes * log exclusion pattern * changes * removing comments * adding enviornment varibale collection/disable * disable env var for cluster variable change * changes * toml parser changes * adding directory tomlrb * changes for container inventory * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Telemetry for config overrides * add schema version telemetry * reduce the number of api calls for namespace filtering add more telemetry for config processing move liveness probe & parser to this repo * optimize for default kube-system namespace log collection exclusion * Fix Scenario when Controller name is empty (#222) * fix ; * ContainerLog collection optimizations (#223) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * merge final changes for release from Vishwa/june2019agentrel to ci_feature (#224) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * fix a minor comment * * change flush from 5 to 10 secs based on perf findings * fix fluent bit tuning for perf run (#226) * fix fluent bit tuning for perf run * stop collecting our own partition * fix merge issue * add release notes for june release in ci_feature branch * fix title * update * fix title * Trim spaces in AKS_REGION (#233) This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Add Logs Size To Telemetry (#234) * Add Logs to telemetry * Using len instead of unsafe.Sizeof * Merge Vishwa/promcustommetrics to ci_feature (#237) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * Fix Region space error (#239) * Trim spaces in AKS_REGION This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Fix out_mdm parsing error * Removing buffer chunk size and buffer max size from fluentbit conf (#240) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * removing buffer chunk size and buffer max size from fluentbit conf * changes (#243) * Collect container last state (#235) * updating the OMS agent to also collect container last state * changed a comment * git surrounded ContainerLastStatus code in a begin/rescue block * added a lot of error checking and logging * Rashmi/fix prom telemetry (#247) * fix prom telemetry * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Merge Health Model work into ci_feature behind a feature flag Pending perf testing (#246) Merge Health to ci_feature * Fix Deserialization Bug (#249) * Fix the bug where capacity is not updated and cached value was being used (#251) * Fix the Capacity computation * fix node cpu and memory limits calculation * changes (#250) * Added new Custom Metrics Regions, fixed MDM plugin crash bug (#253) Added new regions, added handler for MDM plugin start * Add Missing Handlers (#254) * Added Missing Handlers * Return MultiEventStream.new instead of empty array (#256) * Added explicit require_relative to avoid loading errors (#258) * Adding explicit require_relative * Gangams/enable ai telemetry in mc (#252) * enable ai telemetry to configure different ikey and endpoint per cloud * Fixing null check out_mdm bug, tomlparser bug, exposing Replica Set service name as an ENV variable (#261) * Expose replica set service as an env variable * Fixing null check out_mdm bug, and tomlparser bug * Updating the env variable name to be more specific to health model * Changes for creating custom plugins with namespace settings for prometheus scraping (#262) * changes * changes * changes * changes * changes * changes * chnages * changes * telemetry changes * changes * Cherry-pick hotfix 09092019 to ci_feature (#265) * Gangams/add telemetry hybrid (#264) * add telemetry to detect the cloud, distro and kernel version * add null check since providerId optional * detect azurestack cloud * rename to KubernetesProviderID since ProviderID name already used in LA * capture workspaceCloud to the telemetry * trim the domain read from file * KubeMonAgentEvents changes to collect configuration events (#267) * changes * changes * changes * changes * changes * changes * env changes * changes * changes * changes * reverting * changes * cahnges * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * chnages * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Fix the Dupe Perf Data Issue from the DaemonSet (#266) * Dupe Perf Record Fix * PR for 1. Container Memory CPU monitor 2. Configuration for Node Conditions 3. Fixed Type Changes 4. Use Env variable, and health_forward (that handles network errors at init) 5. Unit Tests (#268) * init containers fix and other bug fixes (#269) * init container - KPI and kubeperf changes * changes * changes * changes * changes for empty array fix * changes * changes * pod inventory exception fix * nil check changes * changes * fixing typo * changes * changes * PR - feedback * remove comment * tag pass changes * changes * tagdrop changes * changes * changes * Send agg monitor signal on details change (#270) send when an agg monitor details change, but state did not change * bug fixes for error (#274) * Fix to use declaration and assignment instead of assignment (#275) * bug fixes for error * adding declaration to assignment * 1. Added telemetry (#277) 2. Configuration property changes 3. Bug fixes for a. unscheduled pods returning green 3b. Sometimes, the details hash of agg monitors are different because the order of elements inside the array is different, causing the records to be sent * Bug fix to remove unused variable (#281) * bug fixes for error * adding declaration to assignment * removing unused variable * Fix the WARN<->WARNING typo (#282) * Bug Fixes 1. telemetry send throwing exception if records not initialized 2. permissions error in on-prem clusters (#284) * Bug fixes 1. not writeable, telemetry error * Change to state_WS_dir * Fix Require relative revert (#287) * Bug Fixes for exceptions in telemetry, remove limit set check (#289) * Bug Fixes 10222019 * Initialize container_cpu_memory_records in fhmb * Added telemetry to investigate health exceptions * Set frozen_string_literal to true * Send event once per container when lookup is empty, or limit is an array * Unit Tests, Use RS and POD to determine workload * Fixed Node Condition Bug, added exception handling to return get_rs_owner_ref * Fix the bug where if a warning condition appears before fail condition, the node condition is reported as warning instead of fail. Also fix the node conditions state to consider unknown as a failure state (#292) * Fix for Nodes Aspect not showing up in draft cluster (#294) * Fix the issue where the health tree is inconsistent if a deployment is deleted (#295) * Rashmi/1 16 test (#297) * health deployment update * apps v1 changes for deployment * changes * changes to use relicasets and api groups * Fix duplicate records in container memory/cpu samples (#298) * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * fix exceptions (#306) * Merge Branch morgan into ci_feature (#308) * Fixes : 1) Disable health (for time being) - in DS & RS 2) Disable MDM (for time being) - in DS & RS 3) Merge kubeperf into kubenode & kubepod 4) Made scheduling predictable for kubenode & kubepod 5) Enable containerlog enrichment fields (timeofcommand, containername & containerimage) as a configurable setting (default = true/ON) - Also add telemetry for it 6) Filter OUT type!=Normal events for k8s events 7) AppInsights telemetry async 8) Fix double calling bug in in_win_cadvisor_perf 9) Add connect timeout (20secs) & read timeout (40 secs) for all cadvisor api calls & also for all kubernetes api server calls 10) Fix batchTime for kubepods to be one before making api server call (rather than after making the call, which will make it fluctuate based on api server latency for the call) * fix setting issue for the new enrichcontainerlog setting * fix compilation issue * fix another compilation issue * fix emit issues * fix a nil issue * fix mising tag * * Fix all input plugins for scheduling issue * Merge kubeservices with kubepodinventory (reduce RS to API server by one more) * Remove Kubelogs (not used) * Fix liveness probe * Disable enrichment by default for container logs * Move to yajl json parser across the board for docker provier code * Remove unused files * fix removed files * fix timeofcommand and remove a duplicate entry for a health file. * Rashmi/http leak fixes (#301) * changes for http connection close * close socket in ensure * adding nil check * Rashmi/http leak fixes (#303) * changes for http connection close * close socket in ensure * adding nil check * adding missing end * use yajl for events & nodes parsing. * Rashmi/http leak fixes (#304) * changes for http connection close * close socket in ensure * adding nil check * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * adding missing end * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * changes for chunking * telemetry changes * some fixes * bug fix * changing to have morgan changes only * add new line * use polltime for metrics and disable out_forward for health * enable mdm & health * few optimizations * do not remove time of command make kube.conf same as scale tested config * remove comments from container.conf * remove flush comment for ai telemetry * remove commented code lines * fix config * remove timeofcommand when enrichment==false * fix config * enable mdm filter * Rashmi/api chunk (#307) * changes * changes * refactor changes * changes * changes * changes * changes * node changes * changes * changes * changes * changes * adding open and read timeouts for api client * removing comments * updating chunk size * Update Readme * add back timeofcommand (#310) * Adding new cpu and memory limits to readme * CAdvisor to use 10255/10250 based on env variable (#321) * CAdvisor secure port changes (#320) * cadvsior secure port changes * update to use secure/insecure port for cadvisor * telemetry changes * fix bug * bug fix * changes * Adding cadvisor uri log * switching defaults * update readme * changes Co-authored-by: Vishwanath <visnara@microsoft.com> Co-authored-by: Dilip Raghunathan <dilip.rangarajan@gmail.com> Co-authored-by: bragi92 <kadubey@microsoft.com> Co-authored-by: David Michelman <daweim0@gmail.com> Co-authored-by: ganga1980 <gangams@microsoft.com>
ayusheesingh-zz
pushed a commit
that referenced
this pull request
Jun 27, 2020
* Updatng release history * fixing the plugin logs for emit stream * updating log message * Remove Log Processing from fluentd configuration * Remove plugin references from base_container.data * Dilipr/fluent bit log processing (#126) * Build out_oms.so and include in docker-cimprov package * Adding fluent-bit-config file to base container * PR Feedback * Adding out_oms.conf to base_container.data * PR Feedback * Making the critical section as small as possible * PR Feedback * Fixing the newline bug for Computer, and changing containerId to Id * Dilipr/glide updates (#127) * Updating glide.* files to include lumberjack * containerID="" for pull issues * Using KubeAPI for getting image,name. Adding more logs (#129) * Using KubeAPI for getting image,name. Adding more logs * Moving log file and state file to within the omsagent container * Changing log and state paths * Dilipr/mark comments (#130) * Marks Comments + Error Handling * Drop records from files that are not in k8s format * Remove unnecessary log line' * Adding Log to the file that doesn't conform to the expected format * Rashmi/segfault latest (#132) * adding null checks in all providers * fixing type * fixing type * adding more null checks * update cjson * Adding a missed null check (#135) * reusing some variables (#136) * Rashmi/cjson delete null check (#138) * adding null check for cjson-delete * null chk * removing null check * updating log level to debug for some provider workflows (#139) * Fixing CPU Utilization and removing Fluent-bit filters (#140) Removing fluent-bit filters, CPU optimizations * Minor tweaks 1. Remove some logging 2. Added more Error Handling 3. Continue when there is an error with k8s api (#141) * Removing some logs, added more error checking, continue on kube-api error * Return FLB OK for json Marshall error, instead of RETRY * * Change FluentBit flush interval to 30 secs (from 5 secs) * Remove ContainerPerf, ContainerServiceLog,ContainerProcess (OMI workflows) for Daemonset * Container Log Telemetry * Fixing an issue with Send Init Event if Telemetry is not initialized properly, tab to whitespace in conf file * PR feedback * PR feedback * Sending an event every 5 mins(Heartbeat) (#146) * PR feedback to cleanup removed workflows * updating agent version for telemetry * updating agent version * Telemetry Updates (#149) * Telemetry Fixes 1. Added Log Generation Rate 2. Fixed parsing bugs 3. Added code to send Exceptions/errors * PR Feedback * Changes to send omsagent/omsagent-rs kubectl logs to App Insights (#159) * Changes to send omsagent/omsagent-rs kubectl logs to App Insights * PR Feedback * Rashmi/fluentd docker inventory (#160) * first stab * changes * changes * docker util changes * working tested util * input plugin and conf * changes * changes * changes * changes * changes * working containerinventory * fixing omi removal from container.conf * removing comments * file write and read * deleted containers working * changes * changes * socket timeout * deleting test files * adding log * fixing comment * appinsights changes * changes * tel changes * changes * changes * changes * changes * lib changes * changes * changes * fixes * PR comments * changes * updating the ownership * changes * changes * changes to container data * removing comment * changes * adding collection time * bug fix * env string truncation * changes for acs-engine test * Fix Telemetry Bug -- Initialize Telemetry Client after Initializing all required properties (#162) * Fix kube events memory leak due to yaml serialization for > 5k events (#163) * Setting Timeout for HTTP Client in PostDataHelper in outoms go plugin(#164) * Vishwa/perftelemetry 2 (#165) * add cpu usage telemetry for ds & rs * add cpu & memory usage telemetry for ds & rs * environment variable fix (#166) * environment variable fix * updating agent version * Fixing a bug where we were crashing due to container statuses not present when not was lost (#167) * Updating title * updating right versions for last release * Updating the break condition to look for end of response (#168) * Updating the break condition to look for end of response * changes for docker response * updating AgentVersion for telemetry * Updating readme for latest release changes * Changes - (#173) * use /var/log for state * new metric ContainerLogsAgentSideLatencyMs * new field 'timeOfComand' * Rashmi/kubenodeinventory (#174) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * Get cpuusage from usageseconds (#175) * Rashmi/kubenodeinventory (#176) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * Rashmi/kubenodeinventory (#178) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * Fixing an issue on the cpurate metric, which happens for the first time (when cache is empty) (#179) * Rashmi/kubenodeinventory (#180) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Exclude docker containers from container inventory (#181) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * Exclude pauseamd64 containers from container inventory (#182) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * Update agent version * Updating readme for the latest release * Fix indentation in kube.conf and update readme (#184) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * fixing indentation so that kube.conf contents can be used in config map in the yaml * updating readme to fix date and agent version * updating agent tag * Get Pods for current Node Only (#185) * Fix KubeAPI Calls to filter to get pods for current node * Reinstate log line * changes for container node inventory fixed type (#186) * Fix for mooncake (disable telemetry optionally) (#191) * disable telemetry option * fix a typo * CustomMetrics to ci_feature (#193) Custom Metrics changes to ci_feature * add ContainerNotRunning column to KubePodInventory * merge pr feedback: update name to ContainerStatusReason * Zero Fill for Missing Pod Phases, Change Namespace Dimension to Kubernetes namespace, as it might be confused with metrics namespace in Metrics Explorer (#194) * Zero Fill for Pod Counts by Phase * Change namespace dimension to Kubernetes namespace * No Retries for non 404 4xx errors (#196) * Update agent version for telemetry * Update readme for upcoming (ciprod01202019) release * fix readme formatting * fix formatting for readme * fix formatting for readme * fix readme * fix readme * fix agent version for telemetry * fix date in readme * update readme * Restart logs every 10MB instead of weekly (#198) * Rotate logs every 10MB instead of weekly * Removing some logging, fixed log rotation * update agent version for telemetry * update readme * Update kube.conf to use %STATE_DIR_WS% instead of hardcoded path * Fix AKSEngine Crash (#200) * hotfix * close resp.Body * remove chatty logs * membuf=5m and ignore files not updated since 5 mins * fix readme for new version * Fix the pod count in mdm agent plugin (#203) * Update readme * string freeze for out_mdm plugin * Vishwa/resourcecentric (#208) * resourceid fix (for AKS only) * fix name * Rashmi/win nodepool - PR (#206) * changes for win nodes enumeration * changes * changes * changes * node cpu metric rate changes * container cpu rate * changes * changes * changes * changes * changes * changes to include in_win_cadvisor_perf.rb file * send containerinventoryheartbeatevent * changes * cahnges for mdm metrics * changes * cahnges * changes * container states * changes * changes * changes for env variables * changes * changes * changes * changes * delete comments * changes * mutex changes * changes * changes * changes * telemetry fix for docker version * removing hardcoded values for mdm * update docker version * telemetry for windows cadvisor timeouts * exeception key update to computer * PR comments * adding os to container inventory for windows nodes (#210) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors (#211) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors * Fixing the bug, deferring telemetry changes for later * updating to lowercase compare for units (#212) * Merge from vishwa/telegraftcp to ci_feature for telegraf changes (#214) * merge from Vishwa/telegraf to Vishwa/telegraftcp for telegraf changes (#207) * add configuration for telegraf * fix for perms * fix telegraf config. * fix file location & config * update to config * fix namespace * trying different namespace and also debug=true * add placeholder for nodename * change namespace * updated config * fix uri * fix azMon settings * remove aad settings * add custom metrics regions * fix config * add support for replica-set config * fix oomkilled * Add telegraf 403 metric telemetry & non 403 trace telemetry * fix type * fix package * fix package import * fix filename * delete unused file * conf file for rs; fix 403counttotal metric for telegraf, remove host and use nodeName consistently, rename metrics * fix statefulsets * fix typo. * fix another typo. * fix telemetry * fix casing issue * fix comma issue. * disable telemetry for rs ; fix stateful set name * worksround for namespace fix * telegraf integration - v1 * telemetry changes for telegraf * telemetry & other changes * remove custom metric regions as we dont need anymore * remove un-needed files * fixes * exclude certain volumes and fix telemetry to not have computer & nodename as dimensions (redundant) * Vishwa/resourcecentric (#208) (#209) * resourceid fix (for AKS only) * fix name * near final metric shape * change from customlog to fixed type (InsightsMetrics) * fix PR feedback * fix pr feedback * Fix telemetry error for telegraf err count metric (#215) * Fix Unscheduled Pod bug, remove excess telemetry (#218) * Fix Unscheduled Pod bug, remove excess telemetry * Send Success Telemetry only once after startup for a node in a cluster for MDM Post * Sending telemetry for successful push to MDM every hour * Merge from Vishwa/promstandardmetrics into ci_feature (#220) * enable prometheus metrics collection in replica-set * fixing typos * fix config file path for replicaset * fix configuration * config changes * merge config/settings to ci_feature (#221) * updating fluentbit to use LOG_TAIL_PATH * changes * log exclusion pattern * changes * removing comments * adding enviornment varibale collection/disable * disable env var for cluster variable change * changes * toml parser changes * adding directory tomlrb * changes for container inventory * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Telemetry for config overrides * add schema version telemetry * reduce the number of api calls for namespace filtering add more telemetry for config processing move liveness probe & parser to this repo * optimize for default kube-system namespace log collection exclusion * Fix Scenario when Controller name is empty (#222) * fix ; * ContainerLog collection optimizations (#223) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * merge final changes for release from Vishwa/june2019agentrel to ci_feature (#224) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * fix a minor comment * * change flush from 5 to 10 secs based on perf findings * fix fluent bit tuning for perf run (#226) * fix fluent bit tuning for perf run * stop collecting our own partition * fix merge issue * add release notes for june release in ci_feature branch * fix title * update * fix title * Trim spaces in AKS_REGION (#233) This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Add Logs Size To Telemetry (#234) * Add Logs to telemetry * Using len instead of unsafe.Sizeof * Merge Vishwa/promcustommetrics to ci_feature (#237) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * Fix Region space error (#239) * Trim spaces in AKS_REGION This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Fix out_mdm parsing error * Removing buffer chunk size and buffer max size from fluentbit conf (#240) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * removing buffer chunk size and buffer max size from fluentbit conf * changes (#243) * Collect container last state (#235) * updating the OMS agent to also collect container last state * changed a comment * git surrounded ContainerLastStatus code in a begin/rescue block * added a lot of error checking and logging * Rashmi/fix prom telemetry (#247) * fix prom telemetry * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Merge Health Model work into ci_feature behind a feature flag Pending perf testing (#246) Merge Health to ci_feature * Fix Deserialization Bug (#249) * Fix the bug where capacity is not updated and cached value was being used (#251) * Fix the Capacity computation * fix node cpu and memory limits calculation * changes (#250) * Added new Custom Metrics Regions, fixed MDM plugin crash bug (#253) Added new regions, added handler for MDM plugin start * Add Missing Handlers (#254) * Added Missing Handlers * Return MultiEventStream.new instead of empty array (#256) * Added explicit require_relative to avoid loading errors (#258) * Adding explicit require_relative * Gangams/enable ai telemetry in mc (#252) * enable ai telemetry to configure different ikey and endpoint per cloud * Fixing null check out_mdm bug, tomlparser bug, exposing Replica Set service name as an ENV variable (#261) * Expose replica set service as an env variable * Fixing null check out_mdm bug, and tomlparser bug * Updating the env variable name to be more specific to health model * Changes for creating custom plugins with namespace settings for prometheus scraping (#262) * changes * changes * changes * changes * changes * changes * chnages * changes * telemetry changes * changes * Cherry-pick hotfix 09092019 to ci_feature (#265) * Gangams/add telemetry hybrid (#264) * add telemetry to detect the cloud, distro and kernel version * add null check since providerId optional * detect azurestack cloud * rename to KubernetesProviderID since ProviderID name already used in LA * capture workspaceCloud to the telemetry * trim the domain read from file * KubeMonAgentEvents changes to collect configuration events (#267) * changes * changes * changes * changes * changes * changes * env changes * changes * changes * changes * reverting * changes * cahnges * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * chnages * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Fix the Dupe Perf Data Issue from the DaemonSet (#266) * Dupe Perf Record Fix * PR for 1. Container Memory CPU monitor 2. Configuration for Node Conditions 3. Fixed Type Changes 4. Use Env variable, and health_forward (that handles network errors at init) 5. Unit Tests (#268) * init containers fix and other bug fixes (#269) * init container - KPI and kubeperf changes * changes * changes * changes * changes for empty array fix * changes * changes * pod inventory exception fix * nil check changes * changes * fixing typo * changes * changes * PR - feedback * remove comment * tag pass changes * changes * tagdrop changes * changes * changes * Send agg monitor signal on details change (#270) send when an agg monitor details change, but state did not change * bug fixes for error (#274) * Fix to use declaration and assignment instead of assignment (#275) * bug fixes for error * adding declaration to assignment * 1. Added telemetry (#277) 2. Configuration property changes 3. Bug fixes for a. unscheduled pods returning green 3b. Sometimes, the details hash of agg monitors are different because the order of elements inside the array is different, causing the records to be sent * Bug fix to remove unused variable (#281) * bug fixes for error * adding declaration to assignment * removing unused variable * Fix the WARN<->WARNING typo (#282) * Bug Fixes 1. telemetry send throwing exception if records not initialized 2. permissions error in on-prem clusters (#284) * Bug fixes 1. not writeable, telemetry error * Change to state_WS_dir * Fix Require relative revert (#287) * Bug Fixes for exceptions in telemetry, remove limit set check (#289) * Bug Fixes 10222019 * Initialize container_cpu_memory_records in fhmb * Added telemetry to investigate health exceptions * Set frozen_string_literal to true * Send event once per container when lookup is empty, or limit is an array * Unit Tests, Use RS and POD to determine workload * Fixed Node Condition Bug, added exception handling to return get_rs_owner_ref * Fix the bug where if a warning condition appears before fail condition, the node condition is reported as warning instead of fail. Also fix the node conditions state to consider unknown as a failure state (#292) * Fix for Nodes Aspect not showing up in draft cluster (#294) * Fix the issue where the health tree is inconsistent if a deployment is deleted (#295) * Rashmi/1 16 test (#297) * health deployment update * apps v1 changes for deployment * changes * changes to use relicasets and api groups * Fix duplicate records in container memory/cpu samples (#298) * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * fix exceptions (#306) * Merge Branch morgan into ci_feature (#308) * Fixes : 1) Disable health (for time being) - in DS & RS 2) Disable MDM (for time being) - in DS & RS 3) Merge kubeperf into kubenode & kubepod 4) Made scheduling predictable for kubenode & kubepod 5) Enable containerlog enrichment fields (timeofcommand, containername & containerimage) as a configurable setting (default = true/ON) - Also add telemetry for it 6) Filter OUT type!=Normal events for k8s events 7) AppInsights telemetry async 8) Fix double calling bug in in_win_cadvisor_perf 9) Add connect timeout (20secs) & read timeout (40 secs) for all cadvisor api calls & also for all kubernetes api server calls 10) Fix batchTime for kubepods to be one before making api server call (rather than after making the call, which will make it fluctuate based on api server latency for the call) * fix setting issue for the new enrichcontainerlog setting * fix compilation issue * fix another compilation issue * fix emit issues * fix a nil issue * fix mising tag * * Fix all input plugins for scheduling issue * Merge kubeservices with kubepodinventory (reduce RS to API server by one more) * Remove Kubelogs (not used) * Fix liveness probe * Disable enrichment by default for container logs * Move to yajl json parser across the board for docker provier code * Remove unused files * fix removed files * fix timeofcommand and remove a duplicate entry for a health file. * Rashmi/http leak fixes (#301) * changes for http connection close * close socket in ensure * adding nil check * Rashmi/http leak fixes (#303) * changes for http connection close * close socket in ensure * adding nil check * adding missing end * use yajl for events & nodes parsing. * Rashmi/http leak fixes (#304) * changes for http connection close * close socket in ensure * adding nil check * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * adding missing end * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * changes for chunking * telemetry changes * some fixes * bug fix * changing to have morgan changes only * add new line * use polltime for metrics and disable out_forward for health * enable mdm & health * few optimizations * do not remove time of command make kube.conf same as scale tested config * remove comments from container.conf * remove flush comment for ai telemetry * remove commented code lines * fix config * remove timeofcommand when enrichment==false * fix config * enable mdm filter * Rashmi/api chunk (#307) * changes * changes * refactor changes * changes * changes * changes * changes * node changes * changes * changes * changes * changes * adding open and read timeouts for api client * removing comments * updating chunk size * Update Readme * add back timeofcommand (#310) * update readme for timeofcommand fix (#314) * Merge from ci_feature_prod into ci_feature (fix put back timeofcommand) (#311) (#316) * Updatng release history * fixing the plugin logs for emit stream * updating log message * Remove Log Processing from fluentd configuration * Remove plugin references from base_container.data * Dilipr/fluent bit log processing (#126) * Build out_oms.so and include in docker-cimprov package * Adding fluent-bit-config file to base container * PR Feedback * Adding out_oms.conf to base_container.data * PR Feedback * Making the critical section as small as possible * PR Feedback * Fixing the newline bug for Computer, and changing containerId to Id * Dilipr/glide updates (#127) * Updating glide.* files to include lumberjack * containerID="" for pull issues * Using KubeAPI for getting image,name. Adding more logs (#129) * Using KubeAPI for getting image,name. Adding more logs * Moving log file and state file to within the omsagent container * Changing log and state paths * Dilipr/mark comments (#130) * Marks Comments + Error Handling * Drop records from files that are not in k8s format * Remove unnecessary log line' * Adding Log to the file that doesn't conform to the expected format * Rashmi/segfault latest (#132) * adding null checks in all providers * fixing type * fixing type * adding more null checks * update cjson * Adding a missed null check (#135) * reusing some variables (#136) * Rashmi/cjson delete null check (#138) * adding null check for cjson-delete * null chk * removing null check * updating log level to debug for some provider workflows (#139) * Fixing CPU Utilization and removing Fluent-bit filters (#140) Removing fluent-bit filters, CPU optimizations * Minor tweaks 1. Remove some logging 2. Added more Error Handling 3. Continue when there is an error with k8s api (#141) * Removing some logs, added more error checking, continue on kube-api error * Return FLB OK for json Marshall error, instead of RETRY * * Change FluentBit flush interval to 30 secs (from 5 secs) * Remove ContainerPerf, ContainerServiceLog,ContainerProcess (OMI workflows) for Daemonset * Container Log Telemetry * Fixing an issue with Send Init Event if Telemetry is not initialized properly, tab to whitespace in conf file * PR feedback * PR feedback * Sending an event every 5 mins(Heartbeat) (#146) * PR feedback to cleanup removed workflows * updating agent version for telemetry * updating agent version * Telemetry Updates (#149) * Telemetry Fixes 1. Added Log Generation Rate 2. Fixed parsing bugs 3. Added code to send Exceptions/errors * PR Feedback * Changes to send omsagent/omsagent-rs kubectl logs to App Insights (#159) * Changes to send omsagent/omsagent-rs kubectl logs to App Insights * PR Feedback * Rashmi/fluentd docker inventory (#160) * first stab * changes * changes * docker util changes * working tested util * input plugin and conf * changes * changes * changes * changes * changes * working containerinventory * fixing omi removal from container.conf * removing comments * file write and read * deleted containers working * changes * changes * socket timeout * deleting test files * adding log * fixing comment * appinsights changes * changes * tel changes * changes * changes * changes * changes * lib changes * changes * changes * fixes * PR comments * changes * updating the ownership * changes * changes * changes to container data * removing comment * changes * adding collection time * bug fix * env string truncation * changes for acs-engine test * Fix Telemetry Bug -- Initialize Telemetry Client after Initializing all required properties (#162) * Fix kube events memory leak due to yaml serialization for > 5k events (#163) * Setting Timeout for HTTP Client in PostDataHelper in outoms go plugin(#164) * Vishwa/perftelemetry 2 (#165) * add cpu usage telemetry for ds & rs * add cpu & memory usage telemetry for ds & rs * environment variable fix (#166) * environment variable fix * updating agent version * Fixing a bug where we were crashing due to container statuses not present when not was lost (#167) * Updating title * updating right versions for last release * Updating the break condition to look for end of response (#168) * Updating the break condition to look for end of response * changes for docker response * updating AgentVersion for telemetry * Updating readme for latest release changes * Changes - (#173) * use /var/log for state * new metric ContainerLogsAgentSideLatencyMs * new field 'timeOfComand' * Rashmi/kubenodeinventory (#174) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * Get cpuusage from usageseconds (#175) * Rashmi/kubenodeinventory (#176) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * Rashmi/kubenodeinventory (#178) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * Fixing an issue on the cpurate metric, which happens for the first time (when cache is empty) (#179) * Rashmi/kubenodeinventory (#180) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Exclude docker containers from container inventory (#181) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * Exclude pauseamd64 containers from container inventory (#182) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * Update agent version * Updating readme for the latest release * Fix indentation in kube.conf and update readme (#184) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * fixing indentation so that kube.conf contents can be used in config map in the yaml * updating readme to fix date and agent version * updating agent tag * Get Pods for current Node Only (#185) * Fix KubeAPI Calls to filter to get pods for current node * Reinstate log line * changes for container node inventory fixed type (#186) * Fix for mooncake (disable telemetry optionally) (#191) * disable telemetry option * fix a typo * CustomMetrics to ci_feature (#193) Custom Metrics changes to ci_feature * add ContainerNotRunning column to KubePodInventory * merge pr feedback: update name to ContainerStatusReason * Zero Fill for Missing Pod Phases, Change Namespace Dimension to Kubernetes namespace, as it might be confused with metrics namespace in Metrics Explorer (#194) * Zero Fill for Pod Counts by Phase * Change namespace dimension to Kubernetes namespace * No Retries for non 404 4xx errors (#196) * Update agent version for telemetry * Update readme for upcoming (ciprod01202019) release * fix readme formatting * fix formatting for readme * fix formatting for readme * fix readme * fix readme * fix agent version for telemetry * fix date in readme * update readme * Restart logs every 10MB instead of weekly (#198) * Rotate logs every 10MB instead of weekly * Removing some logging, fixed log rotation * update agent version for telemetry * update readme * Update kube.conf to use %STATE_DIR_WS% instead of hardcoded path * Fix AKSEngine Crash (#200) * hotfix * close resp.Body * remove chatty logs * membuf=5m and ignore files not updated since 5 mins * fix readme for new version * Fix the pod count in mdm agent plugin (#203) * Update readme * string freeze for out_mdm plugin * Vishwa/resourcecentric (#208) * resourceid fix (for AKS only) * fix name * Rashmi/win nodepool - PR (#206) * changes for win nodes enumeration * changes * changes * changes * node cpu metric rate changes * container cpu rate * changes * changes * changes * changes * changes * changes to include in_win_cadvisor_perf.rb file * send containerinventoryheartbeatevent * changes * cahnges for mdm metrics * changes * cahnges * changes * container states * changes * changes * changes for env variables * changes * changes * changes * changes * delete comments * changes * mutex changes * changes * changes * changes * telemetry fix for docker version * removing hardcoded values for mdm * update docker version * telemetry for windows cadvisor timeouts * exeception key update to computer * PR comments * adding os to container inventory for windows nodes (#210) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors (#211) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors * Fixing the bug, deferring telemetry changes for later * updating to lowercase compare for units (#212) * Merge from vishwa/telegraftcp to ci_feature for telegraf changes (#214) * merge from Vishwa/telegraf to Vishwa/telegraftcp for telegraf changes (#207) * add configuration for telegraf * fix for perms * fix telegraf config. * fix file location & config * update to config * fix namespace * trying different namespace and also debug=true * add placeholder for nodename * change namespace * updated config * fix uri * fix azMon settings * remove aad settings * add custom metrics regions * fix config * add support for replica-set config * fix oomkilled * Add telegraf 403 metric telemetry & non 403 trace telemetry * fix type * fix package * fix package import * fix filename * delete unused file * conf file for rs; fix 403counttotal metric for telegraf, remove host and use nodeName consistently, rename metrics * fix statefulsets * fix typo. * fix another typo. * fix telemetry * fix casing issue * fix comma issue. * disable telemetry for rs ; fix stateful set name * worksround for namespace fix * telegraf integration - v1 * telemetry changes for telegraf * telemetry & other changes * remove custom metric regions as we dont need anymore * remove un-needed files * fixes * exclude certain volumes and fix telemetry to not have computer & nodename as dimensions (redundant) * Vishwa/resourcecentric (#208) (#209) * resourceid fix (for AKS only) * fix name * near final metric shape * change from customlog to fixed type (InsightsMetrics) * fix PR feedback * fix pr feedback * Fix telemetry error for telegraf err count metric (#215) * Fix Unscheduled Pod bug, remove excess telemetry (#218) * Fix Unscheduled Pod bug, remove excess telemetry * Send Success Telemetry only once after startup for a node in a cluster for MDM Post * Sending telemetry for successful push to MDM every hour * Merge from Vishwa/promstandardmetrics into ci_feature (#220) * enable prometheus metrics collection in replica-set * fixing typos * fix config file path for replicaset * fix configuration * config changes * merge config/settings to ci_feature (#221) * updating fluentbit to use LOG_TAIL_PATH * changes * log exclusion pattern * changes * removing comments * adding enviornment varibale collection/disable * disable env var for cluster variable change * changes * toml parser changes * adding directory tomlrb * changes for container inventory * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Telemetry for config overrides * add schema version telemetry * reduce the number of api calls for namespace filtering add more telemetry for config processing move liveness probe & parser to this repo * optimize for default kube-system namespace log collection exclusion * Fix Scenario when Controller name is empty (#222) * fix ; * ContainerLog collection optimizations (#223) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * merge final changes for release from Vishwa/june2019agentrel to ci_feature (#224) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * fix a minor comment * * change flush from 5 to 10 secs based on perf findings * fix fluent bit tuning for perf run (#226) * fix fluent bit tuning for perf run * stop collecting our own partition * fix merge issue * add release notes for june release in ci_feature branch * fix title * update * fix title * Trim spaces in AKS_REGION (#233) This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Add Logs Size To Telemetry (#234) * Add Logs to telemetry * Using len instead of unsafe.Sizeof * Merge Vishwa/promcustommetrics to ci_feature (#237) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * Fix Region space error (#239) * Trim spaces in AKS_REGION This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Fix out_mdm parsing error * Removing buffer chunk size and buffer max size from fluentbit conf (#240) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * removing buffer chunk size and buffer max size from fluentbit conf * changes (#243) * Collect container last state (#235) * updating the OMS agent to also collect container last state * changed a comment * git surrounded ContainerLastStatus code in a begin/rescue block * added a lot of error checking and logging * Rashmi/fix prom telemetry (#247) * fix prom telemetry * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Merge Health Model work into ci_feature behind a feature flag Pending perf testing (#246) Merge Health to ci_feature * Fix Deserialization Bug (#249) * Fix the bug where capacity is not updated and cached value was being used (#251) * Fix the Capacity computation * fix node cpu and memory limits calculation * changes (#250) * Added new Custom Metrics Regions, fixed MDM plugin crash bug (#253) Added new regions, added handler for MDM plugin start * Add Missing Handlers (#254) * Added Missing Handlers * Return MultiEventStream.new instead of empty array (#256) * Added explicit require_relative to avoid loading errors (#258) * Adding explicit require_relative * Gangams/enable ai telemetry in mc (#252) * enable ai telemetry to configure different ikey and endpoint per cloud * Fixing null check out_mdm bug, tomlparser bug, exposing Replica Set service name as an ENV variable (#261) * Expose replica set service as an env variable * Fixing null check out_mdm bug, and tomlparser bug * Updating the env variable name to be more specific to health model * Changes for creating custom plugins with namespace settings for prometheus scraping (#262) * changes * changes * changes * changes * changes * changes * chnages * changes * telemetry changes * changes * Cherry-pick hotfix 09092019 to ci_feature (#265) * Gangams/add telemetry hybrid (#264) * add telemetry to detect the cloud, distro and kernel version * add null check since providerId optional * detect azurestack cloud * rename to KubernetesProviderID since ProviderID name already used in LA * capture workspaceCloud to the telemetry * trim the domain read from file * KubeMonAgentEvents changes to collect configuration events (#267) * changes * changes * changes * changes * changes * changes * env changes * changes * changes * changes * reverting * changes * cahnges * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * chnages * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Fix the Dupe Perf Data Issue from the DaemonSet (#266) * Dupe Perf Record Fix * PR for 1. Container Memory CPU monitor 2. Configuration for Node Conditions 3. Fixed Type Changes 4. Use Env variable, and health_forward (that handles network errors at init) 5. Unit Tests (#268) * init containers fix and other bug fixes (#269) * init container - KPI and kubeperf changes * changes * changes * changes * changes for empty array fix * changes * changes * pod inventory exception fix * nil check changes * changes * fixing typo * changes * changes * PR - feedback * remove comment * tag pass changes * changes * tagdrop changes * changes * changes * Send agg monitor signal on details change (#270) send when an agg monitor details change, but state did not change * bug fixes for error (#274) * Fix to use declaration and assignment instead of assignment (#275) * bug fixes for error * adding declaration to assignment * 1. Added telemetry (#277) 2. Configuration property changes 3. Bug fixes for a. unscheduled pods returning green 3b. Sometimes, the details hash of agg monitors are different because the order of elements inside the array is different, causing the records to be sent * Bug fix to remove unused variable (#281) * bug fixes for error * adding declaration to assignment * removing unused variable * Fix the WARN<->WARNING typo (#282) * Bug Fixes 1. telemetry send throwing exception if records not initialized 2. permissions error in on-prem clusters (#284) * Bug fixes 1. not writeable, telemetry error * Change to state_WS_dir * Fix Require relative revert (#287) * Bug Fixes for exceptions in telemetry, remove limit set check (#289) * Bug Fixes 10222019 * Initialize container_cpu_memory_records in fhmb * Added telemetry to investigate health exceptions * Set frozen_string_literal to true * Send event once per container when lookup is empty, or limit is an array * Unit Tests, Use RS and POD to determine workload * Fixed Node Condition Bug, added exception handling to return get_rs_owner_ref * Fix the bug where if a warning condition appears before fail condition, the node condition is reported as warning instead of fail. Also fix the node conditions state to consider unknown as a failure state (#292) * Fix for Nodes Aspect not showing up in draft cluster (#294) * Fix the issue where the health tree is inconsistent if a deployment is deleted (#295) * Rashmi/1 16 test (#297) * health deployment update * apps v1 changes for deployment * changes * changes to use relicasets and api groups * Fix duplicate records in container memory/cpu samples (#298) * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * fix exceptions (#306) * Merge Branch morgan into ci_feature (#308) * Fixes : 1) Disable health (for time being) - in DS & RS 2) Disable MDM (for time being) - in DS & RS 3) Merge kubeperf into kubenode & kubepod 4) Made scheduling predictable for kubenode & kubepod 5) Enable containerlog enrichment fields (timeofcommand, containername & containerimage) as a configurable setting (default = true/ON) - Also add telemetry for it 6) Filter OUT type!=Normal events for k8s events 7) AppInsights telemetry async 8) Fix double calling bug in in_win_cadvisor_perf 9) Add connect timeout (20secs) & read timeout (40 secs) for all cadvisor api calls & also for all kubernetes api server calls 10) Fix batchTime for kubepods to be one before making api server call (rather than after making the call, which will make it fluctuate based on api server latency for the call) * fix setting issue for the new enrichcontainerlog setting * fix compilation issue * fix another compilation issue * fix emit issues * fix a nil issue * fix mising tag * * Fix all input plugins for scheduling issue * Merge kubeservices with kubepodinventory (reduce RS to API server by one more) * Remove Kubelogs (not used) * Fix liveness probe * Disable enrichment by default for container logs * Move to yajl json parser across the board for docker provier code * Remove unused files * fix removed files * fix timeofcommand and remove a duplicate entry for a health file. * Rashmi/http leak fixes (#301) * changes for http connection close * close socket in ensure * adding nil check * Rashmi/http leak fixes (#303) * changes for http connection close * close socket in ensure * adding nil check * adding missing end * use yajl for events & nodes parsing. * Rashmi/http leak fixes (#304) * changes for http connection close * close socket in ensure * adding nil check * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * adding missing end * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * changes for chunking * telemetry changes * some fixes * bug fix * changing to have morgan changes only * add new line * use polltime for metrics and disable out_forward for health * enable mdm & health * few optimizations * do not remove time of command make kube.conf same as scale tested config * remove comments from container.conf * remove flush comment for ai telemetry * remove commented code lines * fix config * remove timeofcommand when enrichment==false * fix config * enable mdm filter * Rashmi/api chunk (#307) * changes * changes * refactor changes * changes * changes * changes * changes * node changes * changes * changes * changes * changes * adding open and read timeouts for api client * removing comments * updating chunk size * Update Readme * add back timeofcommand (#310) * Adding new cpu and memory limits to readme * CAdvisor to use 10255/10250 based on env variable (#321) * CAdvisor secure port changes (#320) * cadvsior secure port changes * update to use secure/insecure port for cadvisor * telemetry changes * fix bug * bug fix * changes * Adding cadvisor uri log * switching defaults * update readme * changes * changing font for code change and customer impact Co-authored-by: Vishwanath <visnara@microsoft.com> Co-authored-by: Dilip Raghunathan <dilip.rangarajan@gmail.com> Co-authored-by: bragi92 <kadubey@microsoft.com> Co-authored-by: David Michelman <daweim0@gmail.com> Co-authored-by: ganga1980 <gangams@microsoft.com>
ayusheesingh-zz
pushed a commit
that referenced
this pull request
Jun 27, 2020
* Updatng release history * fixing the plugin logs for emit stream * updating log message * Remove Log Processing from fluentd configuration * Remove plugin references from base_container.data * Dilipr/fluent bit log processing (#126) * Build out_oms.so and include in docker-cimprov package * Adding fluent-bit-config file to base container * PR Feedback * Adding out_oms.conf to base_container.data * PR Feedback * Making the critical section as small as possible * PR Feedback * Fixing the newline bug for Computer, and changing containerId to Id * Dilipr/glide updates (#127) * Updating glide.* files to include lumberjack * containerID="" for pull issues * Using KubeAPI for getting image,name. Adding more logs (#129) * Using KubeAPI for getting image,name. Adding more logs * Moving log file and state file to within the omsagent container * Changing log and state paths * Dilipr/mark comments (#130) * Marks Comments + Error Handling * Drop records from files that are not in k8s format * Remove unnecessary log line' * Adding Log to the file that doesn't conform to the expected format * Rashmi/segfault latest (#132) * adding null checks in all providers * fixing type * fixing type * adding more null checks * update cjson * Adding a missed null check (#135) * reusing some variables (#136) * Rashmi/cjson delete null check (#138) * adding null check for cjson-delete * null chk * removing null check * updating log level to debug for some provider workflows (#139) * Fixing CPU Utilization and removing Fluent-bit filters (#140) Removing fluent-bit filters, CPU optimizations * Minor tweaks 1. Remove some logging 2. Added more Error Handling 3. Continue when there is an error with k8s api (#141) * Removing some logs, added more error checking, continue on kube-api error * Return FLB OK for json Marshall error, instead of RETRY * * Change FluentBit flush interval to 30 secs (from 5 secs) * Remove ContainerPerf, ContainerServiceLog,ContainerProcess (OMI workflows) for Daemonset * Container Log Telemetry * Fixing an issue with Send Init Event if Telemetry is not initialized properly, tab to whitespace in conf file * PR feedback * PR feedback * Sending an event every 5 mins(Heartbeat) (#146) * PR feedback to cleanup removed workflows * updating agent version for telemetry * updating agent version * Telemetry Updates (#149) * Telemetry Fixes 1. Added Log Generation Rate 2. Fixed parsing bugs 3. Added code to send Exceptions/errors * PR Feedback * Changes to send omsagent/omsagent-rs kubectl logs to App Insights (#159) * Changes to send omsagent/omsagent-rs kubectl logs to App Insights * PR Feedback * Rashmi/fluentd docker inventory (#160) * first stab * changes * changes * docker util changes * working tested util * input plugin and conf * changes * changes * changes * changes * changes * working containerinventory * fixing omi removal from container.conf * removing comments * file write and read * deleted containers working * changes * changes * socket timeout * deleting test files * adding log * fixing comment * appinsights changes * changes * tel changes * changes * changes * changes * changes * lib changes * changes * changes * fixes * PR comments * changes * updating the ownership * changes * changes * changes to container data * removing comment * changes * adding collection time * bug fix * env string truncation * changes for acs-engine test * Fix Telemetry Bug -- Initialize Telemetry Client after Initializing all required properties (#162) * Fix kube events memory leak due to yaml serialization for > 5k events (#163) * Setting Timeout for HTTP Client in PostDataHelper in outoms go plugin(#164) * Vishwa/perftelemetry 2 (#165) * add cpu usage telemetry for ds & rs * add cpu & memory usage telemetry for ds & rs * environment variable fix (#166) * environment variable fix * updating agent version * Fixing a bug where we were crashing due to container statuses not present when not was lost (#167) * Updating title * updating right versions for last release * Updating the break condition to look for end of response (#168) * Updating the break condition to look for end of response * changes for docker response * updating AgentVersion for telemetry * Updating readme for latest release changes * Changes - (#173) * use /var/log for state * new metric ContainerLogsAgentSideLatencyMs * new field 'timeOfComand' * Rashmi/kubenodeinventory (#174) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * Get cpuusage from usageseconds (#175) * Rashmi/kubenodeinventory (#176) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * Rashmi/kubenodeinventory (#178) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * Fixing an issue on the cpurate metric, which happens for the first time (when cache is empty) (#179) * Rashmi/kubenodeinventory (#180) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Exclude docker containers from container inventory (#181) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * Exclude pauseamd64 containers from container inventory (#182) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * Update agent version * Updating readme for the latest release * Fix indentation in kube.conf and update readme (#184) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * fixing indentation so that kube.conf contents can be used in config map in the yaml * updating readme to fix date and agent version * updating agent tag * Get Pods for current Node Only (#185) * Fix KubeAPI Calls to filter to get pods for current node * Reinstate log line * changes for container node inventory fixed type (#186) * Fix for mooncake (disable telemetry optionally) (#191) * disable telemetry option * fix a typo * CustomMetrics to ci_feature (#193) Custom Metrics changes to ci_feature * add ContainerNotRunning column to KubePodInventory * merge pr feedback: update name to ContainerStatusReason * Zero Fill for Missing Pod Phases, Change Namespace Dimension to Kubernetes namespace, as it might be confused with metrics namespace in Metrics Explorer (#194) * Zero Fill for Pod Counts by Phase * Change namespace dimension to Kubernetes namespace * No Retries for non 404 4xx errors (#196) * Update agent version for telemetry * Update readme for upcoming (ciprod01202019) release * fix readme formatting * fix formatting for readme * fix formatting for readme * fix readme * fix readme * fix agent version for telemetry * fix date in readme * update readme * Restart logs every 10MB instead of weekly (#198) * Rotate logs every 10MB instead of weekly * Removing some logging, fixed log rotation * update agent version for telemetry * update readme * Update kube.conf to use %STATE_DIR_WS% instead of hardcoded path * Fix AKSEngine Crash (#200) * hotfix * close resp.Body * remove chatty logs * membuf=5m and ignore files not updated since 5 mins * fix readme for new version * Fix the pod count in mdm agent plugin (#203) * Update readme * string freeze for out_mdm plugin * Vishwa/resourcecentric (#208) * resourceid fix (for AKS only) * fix name * Rashmi/win nodepool - PR (#206) * changes for win nodes enumeration * changes * changes * changes * node cpu metric rate changes * container cpu rate * changes * changes * changes * changes * changes * changes to include in_win_cadvisor_perf.rb file * send containerinventoryheartbeatevent * changes * cahnges for mdm metrics * changes * cahnges * changes * container states * changes * changes * changes for env variables * changes * changes * changes * changes * delete comments * changes * mutex changes * changes * changes * changes * telemetry fix for docker version * removing hardcoded values for mdm * update docker version * telemetry for windows cadvisor timeouts * exeception key update to computer * PR comments * adding os to container inventory for windows nodes (#210) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors (#211) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors * Fixing the bug, deferring telemetry changes for later * updating to lowercase compare for units (#212) * Merge from vishwa/telegraftcp to ci_feature for telegraf changes (#214) * merge from Vishwa/telegraf to Vishwa/telegraftcp for telegraf changes (#207) * add configuration for telegraf * fix for perms * fix telegraf config. * fix file location & config * update to config * fix namespace * trying different namespace and also debug=true * add placeholder for nodename * change namespace * updated config * fix uri * fix azMon settings * remove aad settings * add custom metrics regions * fix config * add support for replica-set config * fix oomkilled * Add telegraf 403 metric telemetry & non 403 trace telemetry * fix type * fix package * fix package import * fix filename * delete unused file * conf file for rs; fix 403counttotal metric for telegraf, remove host and use nodeName consistently, rename metrics * fix statefulsets * fix typo. * fix another typo. * fix telemetry * fix casing issue * fix comma issue. * disable telemetry for rs ; fix stateful set name * worksround for namespace fix * telegraf integration - v1 * telemetry changes for telegraf * telemetry & other changes * remove custom metric regions as we dont need anymore * remove un-needed files * fixes * exclude certain volumes and fix telemetry to not have computer & nodename as dimensions (redundant) * Vishwa/resourcecentric (#208) (#209) * resourceid fix (for AKS only) * fix name * near final metric shape * change from customlog to fixed type (InsightsMetrics) * fix PR feedback * fix pr feedback * Fix telemetry error for telegraf err count metric (#215) * Fix Unscheduled Pod bug, remove excess telemetry (#218) * Fix Unscheduled Pod bug, remove excess telemetry * Send Success Telemetry only once after startup for a node in a cluster for MDM Post * Sending telemetry for successful push to MDM every hour * Merge from Vishwa/promstandardmetrics into ci_feature (#220) * enable prometheus metrics collection in replica-set * fixing typos * fix config file path for replicaset * fix configuration * config changes * merge config/settings to ci_feature (#221) * updating fluentbit to use LOG_TAIL_PATH * changes * log exclusion pattern * changes * removing comments * adding enviornment varibale collection/disable * disable env var for cluster variable change * changes * toml parser changes * adding directory tomlrb * changes for container inventory * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Telemetry for config overrides * add schema version telemetry * reduce the number of api calls for namespace filtering add more telemetry for config processing move liveness probe & parser to this repo * optimize for default kube-system namespace log collection exclusion * Fix Scenario when Controller name is empty (#222) * fix ; * ContainerLog collection optimizations (#223) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * merge final changes for release from Vishwa/june2019agentrel to ci_feature (#224) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * fix a minor comment * * change flush from 5 to 10 secs based on perf findings * fix fluent bit tuning for perf run (#226) * fix fluent bit tuning for perf run * stop collecting our own partition * fix merge issue * add release notes for june release in ci_feature branch * fix title * update * fix title * Trim spaces in AKS_REGION (#233) This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Add Logs Size To Telemetry (#234) * Add Logs to telemetry * Using len instead of unsafe.Sizeof * Merge Vishwa/promcustommetrics to ci_feature (#237) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * Fix Region space error (#239) * Trim spaces in AKS_REGION This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Fix out_mdm parsing error * Removing buffer chunk size and buffer max size from fluentbit conf (#240) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * removing buffer chunk size and buffer max size from fluentbit conf * changes (#243) * Collect container last state (#235) * updating the OMS agent to also collect container last state * changed a comment * git surrounded ContainerLastStatus code in a begin/rescue block * added a lot of error checking and logging * Rashmi/fix prom telemetry (#247) * fix prom telemetry * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Merge Health Model work into ci_feature behind a feature flag Pending perf testing (#246) Merge Health to ci_feature * Fix Deserialization Bug (#249) * Fix the bug where capacity is not updated and cached value was being used (#251) * Fix the Capacity computation * fix node cpu and memory limits calculation * changes (#250) * Added new Custom Metrics Regions, fixed MDM plugin crash bug (#253) Added new regions, added handler for MDM plugin start * Add Missing Handlers (#254) * Added Missing Handlers * Return MultiEventStream.new instead of empty array (#256) * Added explicit require_relative to avoid loading errors (#258) * Adding explicit require_relative * Gangams/enable ai telemetry in mc (#252) * enable ai telemetry to configure different ikey and endpoint per cloud * Fixing null check out_mdm bug, tomlparser bug, exposing Replica Set service name as an ENV variable (#261) * Expose replica set service as an env variable * Fixing null check out_mdm bug, and tomlparser bug * Updating the env variable name to be more specific to health model * Changes for creating custom plugins with namespace settings for prometheus scraping (#262) * changes * changes * changes * changes * changes * changes * chnages * changes * telemetry changes * changes * Cherry-pick hotfix 09092019 to ci_feature (#265) * Gangams/add telemetry hybrid (#264) * add telemetry to detect the cloud, distro and kernel version * add null check since providerId optional * detect azurestack cloud * rename to KubernetesProviderID since ProviderID name already used in LA * capture workspaceCloud to the telemetry * trim the domain read from file * KubeMonAgentEvents changes to collect configuration events (#267) * changes * changes * changes * changes * changes * changes * env changes * changes * changes * changes * reverting * changes * cahnges * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * chnages * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Fix the Dupe Perf Data Issue from the DaemonSet (#266) * Dupe Perf Record Fix * PR for 1. Container Memory CPU monitor 2. Configuration for Node Conditions 3. Fixed Type Changes 4. Use Env variable, and health_forward (that handles network errors at init) 5. Unit Tests (#268) * init containers fix and other bug fixes (#269) * init container - KPI and kubeperf changes * changes * changes * changes * changes for empty array fix * changes * changes * pod inventory exception fix * nil check changes * changes * fixing typo * changes * changes * PR - feedback * remove comment * tag pass changes * changes * tagdrop changes * changes * changes * Send agg monitor signal on details change (#270) send when an agg monitor details change, but state did not change * bug fixes for error (#274) * Fix to use declaration and assignment instead of assignment (#275) * bug fixes for error * adding declaration to assignment * 1. Added telemetry (#277) 2. Configuration property changes 3. Bug fixes for a. unscheduled pods returning green 3b. Sometimes, the details hash of agg monitors are different because the order of elements inside the array is different, causing the records to be sent * Bug fix to remove unused variable (#281) * bug fixes for error * adding declaration to assignment * removing unused variable * Fix the WARN<->WARNING typo (#282) * Bug Fixes 1. telemetry send throwing exception if records not initialized 2. permissions error in on-prem clusters (#284) * Bug fixes 1. not writeable, telemetry error * Change to state_WS_dir * Fix Require relative revert (#287) * Bug Fixes for exceptions in telemetry, remove limit set check (#289) * Bug Fixes 10222019 * Initialize container_cpu_memory_records in fhmb * Added telemetry to investigate health exceptions * Set frozen_string_literal to true * Send event once per container when lookup is empty, or limit is an array * Unit Tests, Use RS and POD to determine workload * Fixed Node Condition Bug, added exception handling to return get_rs_owner_ref * Fix the bug where if a warning condition appears before fail condition, the node condition is reported as warning instead of fail. Also fix the node conditions state to consider unknown as a failure state (#292) * Fix for Nodes Aspect not showing up in draft cluster (#294) * Fix the issue where the health tree is inconsistent if a deployment is deleted (#295) * Rashmi/1 16 test (#297) * health deployment update * apps v1 changes for deployment * changes * changes to use relicasets and api groups * Fix duplicate records in container memory/cpu samples (#298) * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * fix exceptions (#306) * Merge Branch morgan into ci_feature (#308) * Fixes : 1) Disable health (for time being) - in DS & RS 2) Disable MDM (for time being) - in DS & RS 3) Merge kubeperf into kubenode & kubepod 4) Made scheduling predictable for kubenode & kubepod 5) Enable containerlog enrichment fields (timeofcommand, containername & containerimage) as a configurable setting (default = true/ON) - Also add telemetry for it 6) Filter OUT type!=Normal events for k8s events 7) AppInsights telemetry async 8) Fix double calling bug in in_win_cadvisor_perf 9) Add connect timeout (20secs) & read timeout (40 secs) for all cadvisor api calls & also for all kubernetes api server calls 10) Fix batchTime for kubepods to be one before making api server call (rather than after making the call, which will make it fluctuate based on api server latency for the call) * fix setting issue for the new enrichcontainerlog setting * fix compilation issue * fix another compilation issue * fix emit issues * fix a nil issue * fix mising tag * * Fix all input plugins for scheduling issue * Merge kubeservices with kubepodinventory (reduce RS to API server by one more) * Remove Kubelogs (not used) * Fix liveness probe * Disable enrichment by default for container logs * Move to yajl json parser across the board for docker provier code * Remove unused files * fix removed files * fix timeofcommand and remove a duplicate entry for a health file. * Rashmi/http leak fixes (#301) * changes for http connection close * close socket in ensure * adding nil check * Rashmi/http leak fixes (#303) * changes for http connection close * close socket in ensure * adding nil check * adding missing end * use yajl for events & nodes parsing. * Rashmi/http leak fixes (#304) * changes for http connection close * close socket in ensure * adding nil check * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * adding missing end * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * changes for chunking * telemetry changes * some fixes * bug fix * changing to have morgan changes only * add new line * use polltime for metrics and disable out_forward for health * enable mdm & health * few optimizations * do not remove time of command make kube.conf same as scale tested config * remove comments from container.conf * remove flush comment for ai telemetry * remove commented code lines * fix config * remove timeofcommand when enrichment==false * fix config * enable mdm filter * Rashmi/api chunk (#307) * changes * changes * refactor changes * changes * changes * changes * changes * node changes * changes * changes * changes * changes * adding open and read timeouts for api client * removing comments * updating chunk size * Update Readme * add back timeofcommand (#310) * update readme for timeofcommand fix (#314) * Merge from ci_feature_prod into ci_feature (fix put back timeofcommand) (#311) (#316) * Updatng release history * fixing the plugin logs for emit stream * updating log message * Remove Log Processing from fluentd configuration * Remove plugin references from base_container.data * Dilipr/fluent bit log processing (#126) * Build out_oms.so and include in docker-cimprov package * Adding fluent-bit-config file to base container * PR Feedback * Adding out_oms.conf to base_container.data * PR Feedback * Making the critical section as small as possible * PR Feedback * Fixing the newline bug for Computer, and changing containerId to Id * Dilipr/glide updates (#127) * Updating glide.* files to include lumberjack * containerID="" for pull issues * Using KubeAPI for getting image,name. Adding more logs (#129) * Using KubeAPI for getting image,name. Adding more logs * Moving log file and state file to within the omsagent container * Changing log and state paths * Dilipr/mark comments (#130) * Marks Comments + Error Handling * Drop records from files that are not in k8s format * Remove unnecessary log line' * Adding Log to the file that doesn't conform to the expected format * Rashmi/segfault latest (#132) * adding null checks in all providers * fixing type * fixing type * adding more null checks * update cjson * Adding a missed null check (#135) * reusing some variables (#136) * Rashmi/cjson delete null check (#138) * adding null check for cjson-delete * null chk * removing null check * updating log level to debug for some provider workflows (#139) * Fixing CPU Utilization and removing Fluent-bit filters (#140) Removing fluent-bit filters, CPU optimizations * Minor tweaks 1. Remove some logging 2. Added more Error Handling 3. Continue when there is an error with k8s api (#141) * Removing some logs, added more error checking, continue on kube-api error * Return FLB OK for json Marshall error, instead of RETRY * * Change FluentBit flush interval to 30 secs (from 5 secs) * Remove ContainerPerf, ContainerServiceLog,ContainerProcess (OMI workflows) for Daemonset * Container Log Telemetry * Fixing an issue with Send Init Event if Telemetry is not initialized properly, tab to whitespace in conf file * PR feedback * PR feedback * Sending an event every 5 mins(Heartbeat) (#146) * PR feedback to cleanup removed workflows * updating agent version for telemetry * updating agent version * Telemetry Updates (#149) * Telemetry Fixes 1. Added Log Generation Rate 2. Fixed parsing bugs 3. Added code to send Exceptions/errors * PR Feedback * Changes to send omsagent/omsagent-rs kubectl logs to App Insights (#159) * Changes to send omsagent/omsagent-rs kubectl logs to App Insights * PR Feedback * Rashmi/fluentd docker inventory (#160) * first stab * changes * changes * docker util changes * working tested util * input plugin and conf * changes * changes * changes * changes * changes * working containerinventory * fixing omi removal from container.conf * removing comments * file write and read * deleted containers working * changes * changes * socket timeout * deleting test files * adding log * fixing comment * appinsights changes * changes * tel changes * changes * changes * changes * changes * lib changes * changes * changes * fixes * PR comments * changes * updating the ownership * changes * changes * changes to container data * removing comment * changes * adding collection time * bug fix * env string truncation * changes for acs-engine test * Fix Telemetry Bug -- Initialize Telemetry Client after Initializing all required properties (#162) * Fix kube events memory leak due to yaml serialization for > 5k events (#163) * Setting Timeout for HTTP Client in PostDataHelper in outoms go plugin(#164) * Vishwa/perftelemetry 2 (#165) * add cpu usage telemetry for ds & rs * add cpu & memory usage telemetry for ds & rs * environment variable fix (#166) * environment variable fix * updating agent version * Fixing a bug where we were crashing due to container statuses not present when not was lost (#167) * Updating title * updating right versions for last release * Updating the break condition to look for end of response (#168) * Updating the break condition to look for end of response * changes for docker response * updating AgentVersion for telemetry * Updating readme for latest release changes * Changes - (#173) * use /var/log for state * new metric ContainerLogsAgentSideLatencyMs * new field 'timeOfComand' * Rashmi/kubenodeinventory (#174) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * Get cpuusage from usageseconds (#175) * Rashmi/kubenodeinventory (#176) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * Rashmi/kubenodeinventory (#178) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * Fixing an issue on the cpurate metric, which happens for the first time (when cache is empty) (#179) * Rashmi/kubenodeinventory (#180) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Exclude docker containers from container inventory (#181) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * Exclude pauseamd64 containers from container inventory (#182) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * Update agent version * Updating readme for the latest release * Fix indentation in kube.conf and update readme (#184) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * fixing indentation so that kube.conf contents can be used in config map in the yaml * updating readme to fix date and agent version * updating agent tag * Get Pods for current Node Only (#185) * Fix KubeAPI Calls to filter to get pods for current node * Reinstate log line * changes for container node inventory fixed type (#186) * Fix for mooncake (disable telemetry optionally) (#191) * disable telemetry option * fix a typo * CustomMetrics to ci_feature (#193) Custom Metrics changes to ci_feature * add ContainerNotRunning column to KubePodInventory * merge pr feedback: update name to ContainerStatusReason * Zero Fill for Missing Pod Phases, Change Namespace Dimension to Kubernetes namespace, as it might be confused with metrics namespace in Metrics Explorer (#194) * Zero Fill for Pod Counts by Phase * Change namespace dimension to Kubernetes namespace * No Retries for non 404 4xx errors (#196) * Update agent version for telemetry * Update readme for upcoming (ciprod01202019) release * fix readme formatting * fix formatting for readme * fix formatting for readme * fix readme * fix readme * fix agent version for telemetry * fix date in readme * update readme * Restart logs every 10MB instead of weekly (#198) * Rotate logs every 10MB instead of weekly * Removing some logging, fixed log rotation * update agent version for telemetry * update readme * Update kube.conf to use %STATE_DIR_WS% instead of hardcoded path * Fix AKSEngine Crash (#200) * hotfix * close resp.Body * remove chatty logs * membuf=5m and ignore files not updated since 5 mins * fix readme for new version * Fix the pod count in mdm agent plugin (#203) * Update readme * string freeze for out_mdm plugin * Vishwa/resourcecentric (#208) * resourceid fix (for AKS only) * fix name * Rashmi/win nodepool - PR (#206) * changes for win nodes enumeration * changes * changes * changes * node cpu metric rate changes * container cpu rate * changes * changes * changes * changes * changes * changes to include in_win_cadvisor_perf.rb file * send containerinventoryheartbeatevent * changes * cahnges for mdm metrics * changes * cahnges * changes * container states * changes * changes * changes for env variables * changes * changes * changes * changes * delete comments * changes * mutex changes * changes * changes * changes * telemetry fix for docker version * removing hardcoded values for mdm * update docker version * telemetry for windows cadvisor timeouts * exeception key update to computer * PR comments * adding os to container inventory for windows nodes (#210) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors (#211) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors * Fixing the bug, deferring telemetry changes for later * updating to lowercase compare for units (#212) * Merge from vishwa/telegraftcp to ci_feature for telegraf changes (#214) * merge from Vishwa/telegraf to Vishwa/telegraftcp for telegraf changes (#207) * add configuration for telegraf * fix for perms * fix telegraf config. * fix file location & config * update to config * fix namespace * trying different namespace and also debug=true * add placeholder for nodename * change namespace * updated config * fix uri * fix azMon settings * remove aad settings * add custom metrics regions * fix config * add support for replica-set config * fix oomkilled * Add telegraf 403 metric telemetry & non 403 trace telemetry * fix type * fix package * fix package import * fix filename * delete unused file * conf file for rs; fix 403counttotal metric for telegraf, remove host and use nodeName consistently, rename metrics * fix statefulsets * fix typo. * fix another typo. * fix telemetry * fix casing issue * fix comma issue. * disable telemetry for rs ; fix stateful set name * worksround for namespace fix * telegraf integration - v1 * telemetry changes for telegraf * telemetry & other changes * remove custom metric regions as we dont need anymore * remove un-needed files * fixes * exclude certain volumes and fix telemetry to not have computer & nodename as dimensions (redundant) * Vishwa/resourcecentric (#208) (#209) * resourceid fix (for AKS only) * fix name * near final metric shape * change from customlog to fixed type (InsightsMetrics) * fix PR feedback * fix pr feedback * Fix telemetry error for telegraf err count metric (#215) * Fix Unscheduled Pod bug, remove excess telemetry (#218) * Fix Unscheduled Pod bug, remove excess telemetry * Send Success Telemetry only once after startup for a node in a cluster for MDM Post * Sending telemetry for successful push to MDM every hour * Merge from Vishwa/promstandardmetrics into ci_feature (#220) * enable prometheus metrics collection in replica-set * fixing typos * fix config file path for replicaset * fix configuration * config changes * merge config/settings to ci_feature (#221) * updating fluentbit to use LOG_TAIL_PATH * changes * log exclusion pattern * changes * removing comments * adding enviornment varibale collection/disable * disable env var for cluster variable change * changes * toml parser changes * adding directory tomlrb * changes for container inventory * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Telemetry for config overrides * add schema version telemetry * reduce the number of api calls for namespace filtering add more telemetry for config processing move liveness probe & parser to this repo * optimize for default kube-system namespace log collection exclusion * Fix Scenario when Controller name is empty (#222) * fix ; * ContainerLog collection optimizations (#223) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * merge final changes for release from Vishwa/june2019agentrel to ci_feature (#224) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * fix a minor comment * * change flush from 5 to 10 secs based on perf findings * fix fluent bit tuning for perf run (#226) * fix fluent bit tuning for perf run * stop collecting our own partition * fix merge issue * add release notes for june release in ci_feature branch * fix title * update * fix title * Trim spaces in AKS_REGION (#233) This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Add Logs Size To Telemetry (#234) * Add Logs to telemetry * Using len instead of unsafe.Sizeof * Merge Vishwa/promcustommetrics to ci_feature (#237) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * Fix Region space error (#239) * Trim spaces in AKS_REGION This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Fix out_mdm parsing error * Removing buffer chunk size and buffer max size from fluentbit conf (#240) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * removing buffer chunk size and buffer max size from fluentbit conf * changes (#243) * Collect container last state (#235) * updating the OMS agent to also collect container last state * changed a comment * git surrounded ContainerLastStatus code in a begin/rescue block * added a lot of error checking and logging * Rashmi/fix prom telemetry (#247) * fix prom telemetry * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Merge Health Model work into ci_feature behind a feature flag Pending perf testing (#246) Merge Health to ci_feature * Fix Deserialization Bug (#249) * Fix the bug where capacity is not updated and cached value was being used (#251) * Fix the Capacity computation * fix node cpu and memory limits calculation * changes (#250) * Added new Custom Metrics Regions, fixed MDM plugin crash bug (#253) Added new regions, added handler for MDM plugin start * Add Missing Handlers (#254) * Added Missing Handlers * Return MultiEventStream.new instead of empty array (#256) * Added explicit require_relative to avoid loading errors (#258) * Adding explicit require_relative * Gangams/enable ai telemetry in mc (#252) * enable ai telemetry to configure different ikey and endpoint per cloud * Fixing null check out_mdm bug, tomlparser bug, exposing Replica Set service name as an ENV variable (#261) * Expose replica set service as an env variable * Fixing null check out_mdm bug, and tomlparser bug * Updating the env variable name to be more specific to health model * Changes for creating custom plugins with namespace settings for prometheus scraping (#262) * changes * changes * changes * changes * changes * changes * chnages * changes * telemetry changes * changes * Cherry-pick hotfix 09092019 to ci_feature (#265) * Gangams/add telemetry hybrid (#264) * add telemetry to detect the cloud, distro and kernel version * add null check since providerId optional * detect azurestack cloud * rename to KubernetesProviderID since ProviderID name already used in LA * capture workspaceCloud to the telemetry * trim the domain read from file * KubeMonAgentEvents changes to collect configuration events (#267) * changes * changes * changes * changes * changes * changes * env changes * changes * changes * changes * reverting * changes * cahnges * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * chnages * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Fix the Dupe Perf Data Issue from the DaemonSet (#266) * Dupe Perf Record Fix * PR for 1. Container Memory CPU monitor 2. Configuration for Node Conditions 3. Fixed Type Changes 4. Use Env variable, and health_forward (that handles network errors at init) 5. Unit Tests (#268) * init containers fix and other bug fixes (#269) * init container - KPI and kubeperf changes * changes * changes * changes * changes for empty array fix * changes * changes * pod inventory exception fix * nil check changes * changes * fixing typo * changes * changes * PR - feedback * remove comment * tag pass changes * changes * tagdrop changes * changes * changes * Send agg monitor signal on details change (#270) send when an agg monitor details change, but state did not change * bug fixes for error (#274) * Fix to use declaration and assignment instead of assignment (#275) * bug fixes for error * adding declaration to assignment * 1. Added telemetry (#277) 2. Configuration property changes 3. Bug fixes for a. unscheduled pods returning green 3b. Sometimes, the details hash of agg monitors are different because the order of elements inside the array is different, causing the records to be sent * Bug fix to remove unused variable (#281) * bug fixes for error * adding declaration to assignment * removing unused variable * Fix the WARN<->WARNING typo (#282) * Bug Fixes 1. telemetry send throwing exception if records not initialized 2. permissions error in on-prem clusters (#284) * Bug fixes 1. not writeable, telemetry error * Change to state_WS_dir * Fix Require relative revert (#287) * Bug Fixes for exceptions in telemetry, remove limit set check (#289) * Bug Fixes 10222019 * Initialize container_cpu_memory_records in fhmb * Added telemetry to investigate health exceptions * Set frozen_string_literal to true * Send event once per container when lookup is empty, or limit is an array * Unit Tests, Use RS and POD to determine workload * Fixed Node Condition Bug, added exception handling to return get_rs_owner_ref * Fix the bug where if a warning condition appears before fail condition, the node condition is reported as warning instead of fail. Also fix the node conditions state to consider unknown as a failure state (#292) * Fix for Nodes Aspect not showing up in draft cluster (#294) * Fix the issue where the health tree is inconsistent if a deployment is deleted (#295) * Rashmi/1 16 test (#297) * health deployment update * apps v1 changes for deployment * changes * changes to use relicasets and api groups * Fix duplicate records in container memory/cpu samples (#298) * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * fix exceptions (#306) * Merge Branch morgan into ci_feature (#308) * Fixes : 1) Disable health (for time being) - in DS & RS 2) Disable MDM (for time being) - in DS & RS 3) Merge kubeperf into kubenode & kubepod 4) Made scheduling predictable for kubenode & kubepod 5) Enable containerlog enrichment fields (timeofcommand, containername & containerimage) as a configurable setting (default = true/ON) - Also add telemetry for it 6) Filter OUT type!=Normal events for k8s events 7) AppInsights telemetry async 8) Fix double calling bug in in_win_cadvisor_perf 9) Add connect timeout (20secs) & read timeout (40 secs) for all cadvisor api calls & also for all kubernetes api server calls 10) Fix batchTime for kubepods to be one before making api server call (rather than after making the call, which will make it fluctuate based on api server latency for the call) * fix setting issue for the new enrichcontainerlog setting * fix compilation issue * fix another compilation issue * fix emit issues * fix a nil issue * fix mising tag * * Fix all input plugins for scheduling issue * Merge kubeservices with kubepodinventory (reduce RS to API server by one more) * Remove Kubelogs (not used) * Fix liveness probe * Disable enrichment by default for container logs * Move to yajl json parser across the board for docker provier code * Remove unused files * fix removed files * fix timeofcommand and remove a duplicate entry for a health file. * Rashmi/http leak fixes (#301) * changes for http connection close * close socket in ensure * adding nil check * Rashmi/http leak fixes (#303) * changes for http connection close * close socket in ensure * adding nil check * adding missing end * use yajl for events & nodes parsing. * Rashmi/http leak fixes (#304) * changes for http connection close * close socket in ensure * adding nil check * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * adding missing end * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * changes for chunking * telemetry changes * some fixes * bug fix * changing to have morgan changes only * add new line * use polltime for metrics and disable out_forward for health * enable mdm & health * few optimizations * do not remove time of command make kube.conf same as scale tested config * remove comments from container.conf * remove flush comment for ai telemetry * remove commented code lines * fix config * remove timeofcommand when enrichment==false * fix config * enable mdm filter * Rashmi/api chunk (#307) * changes * changes * refactor changes * changes * changes * changes * changes * node changes * changes * changes * changes * changes * adding open and read timeouts for api client * removing comments * updating chunk size * Update Readme * add back timeofcommand (#310) * Adding new cpu and memory limits to readme * CAdvisor to use 10255/10250 based on env variable (#321) * CAdvisor secure port changes (#320) * cadvsior secure port changes * update to use secure/insecure port for cadvisor * telemetry changes * fix bug * bug fix * changes * Adding cadvisor uri log * switching defaults * update readme * changes * changing font for code change and customer impact * For ARO, stop collecting inventory of master and infra (#323) * filter out infra and master nodes inventory for aro * filterout pods info scheduled master and infra nodes * fix redundant KubernetesApiClient name * filter out events sourced from master and infra nodes * fix in kubeapi * add the comments * fix pr feedback * minor updates * fix pr feedback * encode special characters in query * some refactoring * MDM plugin support for large scale clusters (#324) * Batch Commit * WIP: Committing move logic from filter to input * WIP : MDM plugins for scale clusters * Bug fixes 1. cpu percentage 2. bytesize on array. Remove log line * Fixing metric value in cadvisor2mdm plugin * WIP to laptop * Working version with cadvisor changes * Fix Health cpu usage * Added uri for cadvisor failure * Add Null check for kube api responses in in_kube_health (#325) * Fix casing bug (#326) * Missed kube.conf update (#327) * changes to use msi if service principal does not exist (#328) changes to use msi if service principal does not exist (#328) * Adding caseinsensitive compare (#330) Adding case insensitive compare * gpu monitoring (#329) * gpu monitoring * Emit info log for tests for the new insightsmetrics data stream Co-authored-by: Vishwanath <visnara@microsoft.com> Co-authored-by: Dilip Raghunathan <dilip.rangarajan@gmail.com> Co-authored-by: bragi92 <kadubey@microsoft.com> Co-authored-by: David Michelman <daweim0@gmail.com> Co-authored-by: ganga1980 <gangams@microsoft.com>
ayusheesingh-zz
pushed a commit
that referenced
this pull request
Jun 27, 2020
* Updatng release history * fixing the plugin logs for emit stream * updating log message * Remove Log Processing from fluentd configuration * Remove plugin references from base_container.data * Dilipr/fluent bit log processing (#126) * Build out_oms.so and include in docker-cimprov package * Adding fluent-bit-config file to base container * PR Feedback * Adding out_oms.conf to base_container.data * PR Feedback * Making the critical section as small as possible * PR Feedback * Fixing the newline bug for Computer, and changing containerId to Id * Dilipr/glide updates (#127) * Updating glide.* files to include lumberjack * containerID="" for pull issues * Using KubeAPI for getting image,name. Adding more logs (#129) * Using KubeAPI for getting image,name. Adding more logs * Moving log file and state file to within the omsagent container * Changing log and state paths * Dilipr/mark comments (#130) * Marks Comments + Error Handling * Drop records from files that are not in k8s format * Remove unnecessary log line' * Adding Log to the file that doesn't conform to the expected format * Rashmi/segfault latest (#132) * adding null checks in all providers * fixing type * fixing type * adding more null checks * update cjson * Adding a missed null check (#135) * reusing some variables (#136) * Rashmi/cjson delete null check (#138) * adding null check for cjson-delete * null chk * removing null check * updating log level to debug for some provider workflows (#139) * Fixing CPU Utilization and removing Fluent-bit filters (#140) Removing fluent-bit filters, CPU optimizations * Minor tweaks 1. Remove some logging 2. Added more Error Handling 3. Continue when there is an error with k8s api (#141) * Removing some logs, added more error checking, continue on kube-api error * Return FLB OK for json Marshall error, instead of RETRY * * Change FluentBit flush interval to 30 secs (from 5 secs) * Remove ContainerPerf, ContainerServiceLog,ContainerProcess (OMI workflows) for Daemonset * Container Log Telemetry * Fixing an issue with Send Init Event if Telemetry is not initialized properly, tab to whitespace in conf file * PR feedback * PR feedback * Sending an event every 5 mins(Heartbeat) (#146) * PR feedback to cleanup removed workflows * updating agent version for telemetry * updating agent version * Telemetry Updates (#149) * Telemetry Fixes 1. Added Log Generation Rate 2. Fixed parsing bugs 3. Added code to send Exceptions/errors * PR Feedback * Changes to send omsagent/omsagent-rs kubectl logs to App Insights (#159) * Changes to send omsagent/omsagent-rs kubectl logs to App Insights * PR Feedback * Rashmi/fluentd docker inventory (#160) * first stab * changes * changes * docker util changes * working tested util * input plugin and conf * changes * changes * changes * changes * changes * working containerinventory * fixing omi removal from container.conf * removing comments * file write and read * deleted containers working * changes * changes * socket timeout * deleting test files * adding log * fixing comment * appinsights changes * changes * tel changes * changes * changes * changes * changes * lib changes * changes * changes * fixes * PR comments * changes * updating the ownership * changes * changes * changes to container data * removing comment * changes * adding collection time * bug fix * env string truncation * changes for acs-engine test * Fix Telemetry Bug -- Initialize Telemetry Client after Initializing all required properties (#162) * Fix kube events memory leak due to yaml serialization for > 5k events (#163) * Setting Timeout for HTTP Client in PostDataHelper in outoms go plugin(#164) * Vishwa/perftelemetry 2 (#165) * add cpu usage telemetry for ds & rs * add cpu & memory usage telemetry for ds & rs * environment variable fix (#166) * environment variable fix * updating agent version * Fixing a bug where we were crashing due to container statuses not present when not was lost (#167) * Updating title * updating right versions for last release * Updating the break condition to look for end of response (#168) * Updating the break condition to look for end of response * changes for docker response * updating AgentVersion for telemetry * Updating readme for latest release changes * Changes - (#173) * use /var/log for state * new metric ContainerLogsAgentSideLatencyMs * new field 'timeOfComand' * Rashmi/kubenodeinventory (#174) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * Get cpuusage from usageseconds (#175) * Rashmi/kubenodeinventory (#176) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * Rashmi/kubenodeinventory (#178) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * Fixing an issue on the cpurate metric, which happens for the first time (when cache is empty) (#179) * Rashmi/kubenodeinventory (#180) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Exclude docker containers from container inventory (#181) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * Exclude pauseamd64 containers from container inventory (#182) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * Update agent version * Updating readme for the latest release * Fix indentation in kube.conf and update readme (#184) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * fixing indentation so that kube.conf contents can be used in config map in the yaml * updating readme to fix date and agent version * updating agent tag * Get Pods for current Node Only (#185) * Fix KubeAPI Calls to filter to get pods for current node * Reinstate log line * changes for container node inventory fixed type (#186) * Fix for mooncake (disable telemetry optionally) (#191) * disable telemetry option * fix a typo * CustomMetrics to ci_feature (#193) Custom Metrics changes to ci_feature * add ContainerNotRunning column to KubePodInventory * merge pr feedback: update name to ContainerStatusReason * Zero Fill for Missing Pod Phases, Change Namespace Dimension to Kubernetes namespace, as it might be confused with metrics namespace in Metrics Explorer (#194) * Zero Fill for Pod Counts by Phase * Change namespace dimension to Kubernetes namespace * No Retries for non 404 4xx errors (#196) * Update agent version for telemetry * Update readme for upcoming (ciprod01202019) release * fix readme formatting * fix formatting for readme * fix formatting for readme * fix readme * fix readme * fix agent version for telemetry * fix date in readme * update readme * Restart logs every 10MB instead of weekly (#198) * Rotate logs every 10MB instead of weekly * Removing some logging, fixed log rotation * update agent version for telemetry * update readme * Update kube.conf to use %STATE_DIR_WS% instead of hardcoded path * Fix AKSEngine Crash (#200) * hotfix * close resp.Body * remove chatty logs * membuf=5m and ignore files not updated since 5 mins * fix readme for new version * Fix the pod count in mdm agent plugin (#203) * Update readme * string freeze for out_mdm plugin * Vishwa/resourcecentric (#208) * resourceid fix (for AKS only) * fix name * Rashmi/win nodepool - PR (#206) * changes for win nodes enumeration * changes * changes * changes * node cpu metric rate changes * container cpu rate * changes * changes * changes * changes * changes * changes to include in_win_cadvisor_perf.rb file * send containerinventoryheartbeatevent * changes * cahnges for mdm metrics * changes * cahnges * changes * container states * changes * changes * changes for env variables * changes * changes * changes * changes * delete comments * changes * mutex changes * changes * changes * changes * telemetry fix for docker version * removing hardcoded values for mdm * update docker version * telemetry for windows cadvisor timeouts * exeception key update to computer * PR comments * adding os to container inventory for windows nodes (#210) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors (#211) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors * Fixing the bug, deferring telemetry changes for later * updating to lowercase compare for units (#212) * Merge from vishwa/telegraftcp to ci_feature for telegraf changes (#214) * merge from Vishwa/telegraf to Vishwa/telegraftcp for telegraf changes (#207) * add configuration for telegraf * fix for perms * fix telegraf config. * fix file location & config * update to config * fix namespace * trying different namespace and also debug=true * add placeholder for nodename * change namespace * updated config * fix uri * fix azMon settings * remove aad settings * add custom metrics regions * fix config * add support for replica-set config * fix oomkilled * Add telegraf 403 metric telemetry & non 403 trace telemetry * fix type * fix package * fix package import * fix filename * delete unused file * conf file for rs; fix 403counttotal metric for telegraf, remove host and use nodeName consistently, rename metrics * fix statefulsets * fix typo. * fix another typo. * fix telemetry * fix casing issue * fix comma issue. * disable telemetry for rs ; fix stateful set name * worksround for namespace fix * telegraf integration - v1 * telemetry changes for telegraf * telemetry & other changes * remove custom metric regions as we dont need anymore * remove un-needed files * fixes * exclude certain volumes and fix telemetry to not have computer & nodename as dimensions (redundant) * Vishwa/resourcecentric (#208) (#209) * resourceid fix (for AKS only) * fix name * near final metric shape * change from customlog to fixed type (InsightsMetrics) * fix PR feedback * fix pr feedback * Fix telemetry error for telegraf err count metric (#215) * Fix Unscheduled Pod bug, remove excess telemetry (#218) * Fix Unscheduled Pod bug, remove excess telemetry * Send Success Telemetry only once after startup for a node in a cluster for MDM Post * Sending telemetry for successful push to MDM every hour * Merge from Vishwa/promstandardmetrics into ci_feature (#220) * enable prometheus metrics collection in replica-set * fixing typos * fix config file path for replicaset * fix configuration * config changes * merge config/settings to ci_feature (#221) * updating fluentbit to use LOG_TAIL_PATH * changes * log exclusion pattern * changes * removing comments * adding enviornment varibale collection/disable * disable env var for cluster variable change * changes * toml parser changes * adding directory tomlrb * changes for container inventory * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Telemetry for config overrides * add schema version telemetry * reduce the number of api calls for namespace filtering add more telemetry for config processing move liveness probe & parser to this repo * optimize for default kube-system namespace log collection exclusion * Fix Scenario when Controller name is empty (#222) * fix ; * ContainerLog collection optimizations (#223) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * merge final changes for release from Vishwa/june2019agentrel to ci_feature (#224) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * fix a minor comment * * change flush from 5 to 10 secs based on perf findings * fix fluent bit tuning for perf run (#226) * fix fluent bit tuning for perf run * stop collecting our own partition * fix merge issue * add release notes for june release in ci_feature branch * fix title * update * fix title * Trim spaces in AKS_REGION (#233) This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Add Logs Size To Telemetry (#234) * Add Logs to telemetry * Using len instead of unsafe.Sizeof * Merge Vishwa/promcustommetrics to ci_feature (#237) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * Fix Region space error (#239) * Trim spaces in AKS_REGION This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Fix out_mdm parsing error * Removing buffer chunk size and buffer max size from fluentbit conf (#240) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * removing buffer chunk size and buffer max size from fluentbit conf * changes (#243) * Collect container last state (#235) * updating the OMS agent to also collect container last state * changed a comment * git surrounded ContainerLastStatus code in a begin/rescue block * added a lot of error checking and logging * Rashmi/fix prom telemetry (#247) * fix prom telemetry * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Merge Health Model work into ci_feature behind a feature flag Pending perf testing (#246) Merge Health to ci_feature * Fix Deserialization Bug (#249) * Fix the bug where capacity is not updated and cached value was being used (#251) * Fix the Capacity computation * fix node cpu and memory limits calculation * changes (#250) * Added new Custom Metrics Regions, fixed MDM plugin crash bug (#253) Added new regions, added handler for MDM plugin start * Add Missing Handlers (#254) * Added Missing Handlers * Return MultiEventStream.new instead of empty array (#256) * Added explicit require_relative to avoid loading errors (#258) * Adding explicit require_relative * Gangams/enable ai telemetry in mc (#252) * enable ai telemetry to configure different ikey and endpoint per cloud * Fixing null check out_mdm bug, tomlparser bug, exposing Replica Set service name as an ENV variable (#261) * Expose replica set service as an env variable * Fixing null check out_mdm bug, and tomlparser bug * Updating the env variable name to be more specific to health model * Changes for creating custom plugins with namespace settings for prometheus scraping (#262) * changes * changes * changes * changes * changes * changes * chnages * changes * telemetry changes * changes * Cherry-pick hotfix 09092019 to ci_feature (#265) * Gangams/add telemetry hybrid (#264) * add telemetry to detect the cloud, distro and kernel version * add null check since providerId optional * detect azurestack cloud * rename to KubernetesProviderID since ProviderID name already used in LA * capture workspaceCloud to the telemetry * trim the domain read from file * KubeMonAgentEvents changes to collect configuration events (#267) * changes * changes * changes * changes * changes * changes * env changes * changes * changes * changes * reverting * changes * cahnges * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * chnages * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Fix the Dupe Perf Data Issue from the DaemonSet (#266) * Dupe Perf Record Fix * PR for 1. Container Memory CPU monitor 2. Configuration for Node Conditions 3. Fixed Type Changes 4. Use Env variable, and health_forward (that handles network errors at init) 5. Unit Tests (#268) * init containers fix and other bug fixes (#269) * init container - KPI and kubeperf changes * changes * changes * changes * changes for empty array fix * changes * changes * pod inventory exception fix * nil check changes * changes * fixing typo * changes * changes * PR - feedback * remove comment * tag pass changes * changes * tagdrop changes * changes * changes * Send agg monitor signal on details change (#270) send when an agg monitor details change, but state did not change * bug fixes for error (#274) * Fix to use declaration and assignment instead of assignment (#275) * bug fixes for error * adding declaration to assignment * 1. Added telemetry (#277) 2. Configuration property changes 3. Bug fixes for a. unscheduled pods returning green 3b. Sometimes, the details hash of agg monitors are different because the order of elements inside the array is different, causing the records to be sent * Bug fix to remove unused variable (#281) * bug fixes for error * adding declaration to assignment * removing unused variable * Fix the WARN<->WARNING typo (#282) * Bug Fixes 1. telemetry send throwing exception if records not initialized 2. permissions error in on-prem clusters (#284) * Bug fixes 1. not writeable, telemetry error * Change to state_WS_dir * Fix Require relative revert (#287) * Bug Fixes for exceptions in telemetry, remove limit set check (#289) * Bug Fixes 10222019 * Initialize container_cpu_memory_records in fhmb * Added telemetry to investigate health exceptions * Set frozen_string_literal to true * Send event once per container when lookup is empty, or limit is an array * Unit Tests, Use RS and POD to determine workload * Fixed Node Condition Bug, added exception handling to return get_rs_owner_ref * Fix the bug where if a warning condition appears before fail condition, the node condition is reported as warning instead of fail. Also fix the node conditions state to consider unknown as a failure state (#292) * Fix for Nodes Aspect not showing up in draft cluster (#294) * Fix the issue where the health tree is inconsistent if a deployment is deleted (#295) * Rashmi/1 16 test (#297) * health deployment update * apps v1 changes for deployment * changes * changes to use relicasets and api groups * Fix duplicate records in container memory/cpu samples (#298) * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * fix exceptions (#306) * Merge Branch morgan into ci_feature (#308) * Fixes : 1) Disable health (for time being) - in DS & RS 2) Disable MDM (for time being) - in DS & RS 3) Merge kubeperf into kubenode & kubepod 4) Made scheduling predictable for kubenode & kubepod 5) Enable containerlog enrichment fields (timeofcommand, containername & containerimage) as a configurable setting (default = true/ON) - Also add telemetry for it 6) Filter OUT type!=Normal events for k8s events 7) AppInsights telemetry async 8) Fix double calling bug in in_win_cadvisor_perf 9) Add connect timeout (20secs) & read timeout (40 secs) for all cadvisor api calls & also for all kubernetes api server calls 10) Fix batchTime for kubepods to be one before making api server call (rather than after making the call, which will make it fluctuate based on api server latency for the call) * fix setting issue for the new enrichcontainerlog setting * fix compilation issue * fix another compilation issue * fix emit issues * fix a nil issue * fix mising tag * * Fix all input plugins for scheduling issue * Merge kubeservices with kubepodinventory (reduce RS to API server by one more) * Remove Kubelogs (not used) * Fix liveness probe * Disable enrichment by default for container logs * Move to yajl json parser across the board for docker provier code * Remove unused files * fix removed files * fix timeofcommand and remove a duplicate entry for a health file. * Rashmi/http leak fixes (#301) * changes for http connection close * close socket in ensure * adding nil check * Rashmi/http leak fixes (#303) * changes for http connection close * close socket in ensure * adding nil check * adding missing end * use yajl for events & nodes parsing. * Rashmi/http leak fixes (#304) * changes for http connection close * close socket in ensure * adding nil check * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * adding missing end * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * changes for chunking * telemetry changes * some fixes * bug fix * changing to have morgan changes only * add new line * use polltime for metrics and disable out_forward for health * enable mdm & health * few optimizations * do not remove time of command make kube.conf same as scale tested config * remove comments from container.conf * remove flush comment for ai telemetry * remove commented code lines * fix config * remove timeofcommand when enrichment==false * fix config * enable mdm filter * Rashmi/api chunk (#307) * changes * changes * refactor changes * changes * changes * changes * changes * node changes * changes * changes * changes * changes * adding open and read timeouts for api client * removing comments * updating chunk size * Update Readme * add back timeofcommand (#310) * update readme for timeofcommand fix (#314) * Merge from ci_feature_prod into ci_feature (fix put back timeofcommand) (#311) (#316) * Updatng release history * fixing the plugin logs for emit stream * updating log message * Remove Log Processing from fluentd configuration * Remove plugin references from base_container.data * Dilipr/fluent bit log processing (#126) * Build out_oms.so and include in docker-cimprov package * Adding fluent-bit-config file to base container * PR Feedback * Adding out_oms.conf to base_container.data * PR Feedback * Making the critical section as small as possible * PR Feedback * Fixing the newline bug for Computer, and changing containerId to Id * Dilipr/glide updates (#127) * Updating glide.* files to include lumberjack * containerID="" for pull issues * Using KubeAPI for getting image,name. Adding more logs (#129) * Using KubeAPI for getting image,name. Adding more logs * Moving log file and state file to within the omsagent container * Changing log and state paths * Dilipr/mark comments (#130) * Marks Comments + Error Handling * Drop records from files that are not in k8s format * Remove unnecessary log line' * Adding Log to the file that doesn't conform to the expected format * Rashmi/segfault latest (#132) * adding null checks in all providers * fixing type * fixing type * adding more null checks * update cjson * Adding a missed null check (#135) * reusing some variables (#136) * Rashmi/cjson delete null check (#138) * adding null check for cjson-delete * null chk * removing null check * updating log level to debug for some provider workflows (#139) * Fixing CPU Utilization and removing Fluent-bit filters (#140) Removing fluent-bit filters, CPU optimizations * Minor tweaks 1. Remove some logging 2. Added more Error Handling 3. Continue when there is an error with k8s api (#141) * Removing some logs, added more error checking, continue on kube-api error * Return FLB OK for json Marshall error, instead of RETRY * * Change FluentBit flush interval to 30 secs (from 5 secs) * Remove ContainerPerf, ContainerServiceLog,ContainerProcess (OMI workflows) for Daemonset * Container Log Telemetry * Fixing an issue with Send Init Event if Telemetry is not initialized properly, tab to whitespace in conf file * PR feedback * PR feedback * Sending an event every 5 mins(Heartbeat) (#146) * PR feedback to cleanup removed workflows * updating agent version for telemetry * updating agent version * Telemetry Updates (#149) * Telemetry Fixes 1. Added Log Generation Rate 2. Fixed parsing bugs 3. Added code to send Exceptions/errors * PR Feedback * Changes to send omsagent/omsagent-rs kubectl logs to App Insights (#159) * Changes to send omsagent/omsagent-rs kubectl logs to App Insights * PR Feedback * Rashmi/fluentd docker inventory (#160) * first stab * changes * changes * docker util changes * working tested util * input plugin and conf * changes * changes * changes * changes * changes * working containerinventory * fixing omi removal from container.conf * removing comments * file write and read * deleted containers working * changes * changes * socket timeout * deleting test files * adding log * fixing comment * appinsights changes * changes * tel changes * changes * changes * changes * changes * lib changes * changes * changes * fixes * PR comments * changes * updating the ownership * changes * changes * changes to container data * removing comment * changes * adding collection time * bug fix * env string truncation * changes for acs-engine test * Fix Telemetry Bug -- Initialize Telemetry Client after Initializing all required properties (#162) * Fix kube events memory leak due to yaml serialization for > 5k events (#163) * Setting Timeout for HTTP Client in PostDataHelper in outoms go plugin(#164) * Vishwa/perftelemetry 2 (#165) * add cpu usage telemetry for ds & rs * add cpu & memory usage telemetry for ds & rs * environment variable fix (#166) * environment variable fix * updating agent version * Fixing a bug where we were crashing due to container statuses not present when not was lost (#167) * Updating title * updating right versions for last release * Updating the break condition to look for end of response (#168) * Updating the break condition to look for end of response * changes for docker response * updating AgentVersion for telemetry * Updating readme for latest release changes * Changes - (#173) * use /var/log for state * new metric ContainerLogsAgentSideLatencyMs * new field 'timeOfComand' * Rashmi/kubenodeinventory (#174) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * Get cpuusage from usageseconds (#175) * Rashmi/kubenodeinventory (#176) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * Rashmi/kubenodeinventory (#178) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * Fixing an issue on the cpurate metric, which happens for the first time (when cache is empty) (#179) * Rashmi/kubenodeinventory (#180) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Exclude docker containers from container inventory (#181) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * Exclude pauseamd64 containers from container inventory (#182) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * Update agent version * Updating readme for the latest release * Fix indentation in kube.conf and update readme (#184) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * fixing indentation so that kube.conf contents can be used in config map in the yaml * updating readme to fix date and agent version * updating agent tag * Get Pods for current Node Only (#185) * Fix KubeAPI Calls to filter to get pods for current node * Reinstate log line * changes for container node inventory fixed type (#186) * Fix for mooncake (disable telemetry optionally) (#191) * disable telemetry option * fix a typo * CustomMetrics to ci_feature (#193) Custom Metrics changes to ci_feature * add ContainerNotRunning column to KubePodInventory * merge pr feedback: update name to ContainerStatusReason * Zero Fill for Missing Pod Phases, Change Namespace Dimension to Kubernetes namespace, as it might be confused with metrics namespace in Metrics Explorer (#194) * Zero Fill for Pod Counts by Phase * Change namespace dimension to Kubernetes namespace * No Retries for non 404 4xx errors (#196) * Update agent version for telemetry * Update readme for upcoming (ciprod01202019) release * fix readme formatting * fix formatting for readme * fix formatting for readme * fix readme * fix readme * fix agent version for telemetry * fix date in readme * update readme * Restart logs every 10MB instead of weekly (#198) * Rotate logs every 10MB instead of weekly * Removing some logging, fixed log rotation * update agent version for telemetry * update readme * Update kube.conf to use %STATE_DIR_WS% instead of hardcoded path * Fix AKSEngine Crash (#200) * hotfix * close resp.Body * remove chatty logs * membuf=5m and ignore files not updated since 5 mins * fix readme for new version * Fix the pod count in mdm agent plugin (#203) * Update readme * string freeze for out_mdm plugin * Vishwa/resourcecentric (#208) * resourceid fix (for AKS only) * fix name * Rashmi/win nodepool - PR (#206) * changes for win nodes enumeration * changes * changes * changes * node cpu metric rate changes * container cpu rate * changes * changes * changes * changes * changes * changes to include in_win_cadvisor_perf.rb file * send containerinventoryheartbeatevent * changes * cahnges for mdm metrics * changes * cahnges * changes * container states * changes * changes * changes for env variables * changes * changes * changes * changes * delete comments * changes * mutex changes * changes * changes * changes * telemetry fix for docker version * removing hardcoded values for mdm * update docker version * telemetry for windows cadvisor timeouts * exeception key update to computer * PR comments * adding os to container inventory for windows nodes (#210) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors (#211) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors * Fixing the bug, deferring telemetry changes for later * updating to lowercase compare for units (#212) * Merge from vishwa/telegraftcp to ci_feature for telegraf changes (#214) * merge from Vishwa/telegraf to Vishwa/telegraftcp for telegraf changes (#207) * add configuration for telegraf * fix for perms * fix telegraf config. * fix file location & config * update to config * fix namespace * trying different namespace and also debug=true * add placeholder for nodename * change namespace * updated config * fix uri * fix azMon settings * remove aad settings * add custom metrics regions * fix config * add support for replica-set config * fix oomkilled * Add telegraf 403 metric telemetry & non 403 trace telemetry * fix type * fix package * fix package import * fix filename * delete unused file * conf file for rs; fix 403counttotal metric for telegraf, remove host and use nodeName consistently, rename metrics * fix statefulsets * fix typo. * fix another typo. * fix telemetry * fix casing issue * fix comma issue. * disable telemetry for rs ; fix stateful set name * worksround for namespace fix * telegraf integration - v1 * telemetry changes for telegraf * telemetry & other changes * remove custom metric regions as we dont need anymore * remove un-needed files * fixes * exclude certain volumes and fix telemetry to not have computer & nodename as dimensions (redundant) * Vishwa/resourcecentric (#208) (#209) * resourceid fix (for AKS only) * fix name * near final metric shape * change from customlog to fixed type (InsightsMetrics) * fix PR feedback * fix pr feedback * Fix telemetry error for telegraf err count metric (#215) * Fix Unscheduled Pod bug, remove excess telemetry (#218) * Fix Unscheduled Pod bug, remove excess telemetry * Send Success Telemetry only once after startup for a node in a cluster for MDM Post * Sending telemetry for successful push to MDM every hour * Merge from Vishwa/promstandardmetrics into ci_feature (#220) * enable prometheus metrics collection in replica-set * fixing typos * fix config file path for replicaset * fix configuration * config changes * merge config/settings to ci_feature (#221) * updating fluentbit to use LOG_TAIL_PATH * changes * log exclusion pattern * changes * removing comments * adding enviornment varibale collection/disable * disable env var for cluster variable change * changes * toml parser changes * adding directory tomlrb * changes for container inventory * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Telemetry for config overrides * add schema version telemetry * reduce the number of api calls for namespace filtering add more telemetry for config processing move liveness probe & parser to this repo * optimize for default kube-system namespace log collection exclusion * Fix Scenario when Controller name is empty (#222) * fix ; * ContainerLog collection optimizations (#223) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * merge final changes for release from Vishwa/june2019agentrel to ci_feature (#224) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * fix a minor comment * * change flush from 5 to 10 secs based on perf findings * fix fluent bit tuning for perf run (#226) * fix fluent bit tuning for perf run * stop collecting our own partition * fix merge issue * add release notes for june release in ci_feature branch * fix title * update * fix title * Trim spaces in AKS_REGION (#233) This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Add Logs Size To Telemetry (#234) * Add Logs to telemetry * Using len instead of unsafe.Sizeof * Merge Vishwa/promcustommetrics to ci_feature (#237) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * Fix Region space error (#239) * Trim spaces in AKS_REGION This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Fix out_mdm parsing error * Removing buffer chunk size and buffer max size from fluentbit conf (#240) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * removing buffer chunk size and buffer max size from fluentbit conf * changes (#243) * Collect container last state (#235) * updating the OMS agent to also collect container last state * changed a comment * git surrounded ContainerLastStatus code in a begin/rescue block * added a lot of error checking and logging * Rashmi/fix prom telemetry (#247) * fix prom telemetry * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Merge Health Model work into ci_feature behind a feature flag Pending perf testing (#246) Merge Health to ci_feature * Fix Deserialization Bug (#249) * Fix the bug where capacity is not updated and cached value was being used (#251) * Fix the Capacity computation * fix node cpu and memory limits calculation * changes (#250) * Added new Custom Metrics Regions, fixed MDM plugin crash bug (#253) Added new regions, added handler for MDM plugin start * Add Missing Handlers (#254) * Added Missing Handlers * Return MultiEventStream.new instead of empty array (#256) * Added explicit require_relative to avoid loading errors (#258) * Adding explicit require_relative * Gangams/enable ai telemetry in mc (#252) * enable ai telemetry to configure different ikey and endpoint per cloud * Fixing null check out_mdm bug, tomlparser bug, exposing Replica Set service name as an ENV variable (#261) * Expose replica set service as an env variable * Fixing null check out_mdm bug, and tomlparser bug * Updating the env variable name to be more specific to health model * Changes for creating custom plugins with namespace settings for prometheus scraping (#262) * changes * changes * changes * changes * changes * changes * chnages * changes * telemetry changes * changes * Cherry-pick hotfix 09092019 to ci_feature (#265) * Gangams/add telemetry hybrid (#264) * add telemetry to detect the cloud, distro and kernel version * add null check since providerId optional * detect azurestack cloud * rename to KubernetesProviderID since ProviderID name already used in LA * capture workspaceCloud to the telemetry * trim the domain read from file * KubeMonAgentEvents changes to collect configuration events (#267) * changes * changes * changes * changes * changes * changes * env changes * changes * changes * changes * reverting * changes * cahnges * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * chnages * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Fix the Dupe Perf Data Issue from the DaemonSet (#266) * Dupe Perf Record Fix * PR for 1. Container Memory CPU monitor 2. Configuration for Node Conditions 3. Fixed Type Changes 4. Use Env variable, and health_forward (that handles network errors at init) 5. Unit Tests (#268) * init containers fix and other bug fixes (#269) * init container - KPI and kubeperf changes * changes * changes * changes * changes for empty array fix * changes * changes * pod inventory exception fix * nil check changes * changes * fixing typo * changes * changes * PR - feedback * remove comment * tag pass changes * changes * tagdrop changes * changes * changes * Send agg monitor signal on details change (#270) send when an agg monitor details change, but state did not change * bug fixes for error (#274) * Fix to use declaration and assignment instead of assignment (#275) * bug fixes for error * adding declaration to assignment * 1. Added telemetry (#277) 2. Configuration property changes 3. Bug fixes for a. unscheduled pods returning green 3b. Sometimes, the details hash of agg monitors are different because the order of elements inside the array is different, causing the records to be sent * Bug fix to remove unused variable (#281) * bug fixes for error * adding declaration to assignment * removing unused variable * Fix the WARN<->WARNING typo (#282) * Bug Fixes 1. telemetry send throwing exception if records not initialized 2. permissions error in on-prem clusters (#284) * Bug fixes 1. not writeable, telemetry error * Change to state_WS_dir * Fix Require relative revert (#287) * Bug Fixes for exceptions in telemetry, remove limit set check (#289) * Bug Fixes 10222019 * Initialize container_cpu_memory_records in fhmb * Added telemetry to investigate health exceptions * Set frozen_string_literal to true * Send event once per container when lookup is empty, or limit is an array * Unit Tests, Use RS and POD to determine workload * Fixed Node Condition Bug, added exception handling to return get_rs_owner_ref * Fix the bug where if a warning condition appears before fail condition, the node condition is reported as warning instead of fail. Also fix the node conditions state to consider unknown as a failure state (#292) * Fix for Nodes Aspect not showing up in draft cluster (#294) * Fix the issue where the health tree is inconsistent if a deployment is deleted (#295) * Rashmi/1 16 test (#297) * health deployment update * apps v1 changes for deployment * changes * changes to use relicasets and api groups * Fix duplicate records in container memory/cpu samples (#298) * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * fix exceptions (#306) * Merge Branch morgan into ci_feature (#308) * Fixes : 1) Disable health (for time being) - in DS & RS 2) Disable MDM (for time being) - in DS & RS 3) Merge kubeperf into kubenode & kubepod 4) Made scheduling predictable for kubenode & kubepod 5) Enable containerlog enrichment fields (timeofcommand, containername & containerimage) as a configurable setting (default = true/ON) - Also add telemetry for it 6) Filter OUT type!=Normal events for k8s events 7) AppInsights telemetry async 8) Fix double calling bug in in_win_cadvisor_perf 9) Add connect timeout (20secs) & read timeout (40 secs) for all cadvisor api calls & also for all kubernetes api server calls 10) Fix batchTime for kubepods to be one before making api server call (rather than after making the call, which will make it fluctuate based on api server latency for the call) * fix setting issue for the new enrichcontainerlog setting * fix compilation issue * fix another compilation issue * fix emit issues * fix a nil issue * fix mising tag * * Fix all input plugins for scheduling issue * Merge kubeservices with kubepodinventory (reduce RS to API server by one more) * Remove Kubelogs (not used) * Fix liveness probe * Disable enrichment by default for container logs * Move to yajl json parser across the board for docker provier code * Remove unused files * fix removed files * fix timeofcommand and remove a duplicate entry for a health file. * Rashmi/http leak fixes (#301) * changes for http connection close * close socket in ensure * adding nil check * Rashmi/http leak fixes (#303) * changes for http connection close * close socket in ensure * adding nil check * adding missing end * use yajl for events & nodes parsing. * Rashmi/http leak fixes (#304) * changes for http connection close * close socket in ensure * adding nil check * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * adding missing end * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * changes for chunking * telemetry changes * some fixes * bug fix * changing to have morgan changes only * add new line * use polltime for metrics and disable out_forward for health * enable mdm & health * few optimizations * do not remove time of command make kube.conf same as scale tested config * remove comments from container.conf * remove flush comment for ai telemetry * remove commented code lines * fix config * remove timeofcommand when enrichment==false * fix config * enable mdm filter * Rashmi/api chunk (#307) * changes * changes * refactor changes * changes * changes * changes * changes * node changes * changes * changes * changes * changes * adding open and read timeouts for api client * removing comments * updating chunk size * Update Readme * add back timeofcommand (#310) * Adding new cpu and memory limits to readme * CAdvisor to use 10255/10250 based on env variable (#321) * CAdvisor secure port changes (#320) * cadvsior secure port changes * update to use secure/insecure port for cadvisor * telemetry changes * fix bug * bug fix * changes * Adding cadvisor uri log * switching defaults * update readme * changes * changing font for code change and customer impact * For ARO, stop collecting inventory of master and infra (#323) * filter out infra and master nodes inventory for aro * filterout pods info scheduled master and infra nodes * fix redundant KubernetesApiClient name * filter out events sourced from master and infra nodes * fix in kubeapi * add the comments * fix pr feedback * minor updates * fix pr feedback * encode special characters in query * some refactoring * MDM plugin support for large scale clusters (#324) * Batch Commit * WIP: Committing move logic from filter to input * WIP : MDM plugins for scale clusters * Bug fixes 1. cpu percentage 2. bytesize on array. Remove log line * Fixing metric value in cadvisor2mdm plugin * WIP to laptop * Working version with cadvisor changes * Fix Health cpu usage * Added uri for cadvisor failure * Add Null check for kube api responses in in_kube_health (#325) * Fix casing bug (#326) * Missed kube.conf update (#327) * changes to use msi if service principal does not exist (#328) changes to use msi if service principal does not exist (#328) * Adding caseinsensitive compare (#330) Adding case insensitive compare * gpu monitoring (#329) * gpu monitoring * Emit info log for tests for the new insightsmetrics data stream * Update release notes * MDM batch Bug (#336) * kube evnts bug fix (#335) * Update readme.md * Rate Limiting changes (#338) * Add header for throttling Add telemetry for throttling Flush 15secs for telegraf metrics * * Add requestid & log it for errors * Fix bug in semver * Gangams/add support cri runtime docker env (#337) * trim log tag for cri compatible log lines * fix variable declaration * debug log messages * trim spaces * remove debug logs * fix build error * add pods api in cadvisor * wip * wip * wip * wip * fix bug * fix bug * fix bug with end * fix bug in the reference * fix bug * fix telemetry * fix image and imageid bug * fix containerid bug * fix envvars * fix syntax error * fix syntax error * fix env and command * handle labels and annotation fieldref as envs * wip * implement obtain envvars * refactor the code * fix log warn * add more logging * fix npe * revert kubelet api to get the envvars * fix minor issue * fix log message * comment valueFrom for now since this causing some issue * fix formatting issue * fix bug * add resourcefieldref support in env * include init containers as well * fix bug * fix typo * use chomp instead of delete_suffix * add secretkeyref support * add secretkeyref support * fix bug * handle dockerversion for non-docker runtimes * fix comment * handle kubelet metrics to handle runtime and k8s versions * remove _total metrics until we support 1.18 * fetch envvars via proc environ * fix undefined vars * fix undefined vars * add init containers * split on null character * consider crio containers main proceses for envvars * unschedule pods or containers with pull issues * fix issue * fix feedback * update to azmon parsers * fix file extension * add some debug messages * fix timestamp issue * fix pr feedback * refactor linux containerenvvars code * refactor windows container inventory code * fix rb file load error from in_kube_node inventory * add new rb file base_container.data * turnoff docker container inventory * cleanup debug logs * fix npe in log message * avoid environ for terminated containers * fix pr feedback * remove empty file * clean up debug logs * Gangams/fix cri exceptions (#339) * fix nil exception * add exception logging message * fix npe * fix log exception (#340) * Updating release notes, mdm bug fix (#333) (#343) Updating release notes, mdm bug fix for release Co-authored-by: rashmichandrashekar <rashmy@microsoft.com> Co-authored-by: Vishwanath Narasimhan <visnara@microsoft.com> Co-authored-by: rashmy <rashmy@RASHMY-PC2> Co-authored-by: r-dilip <dilip.rangarajan@gmail.com> Co-authored-by: rashmichandrashekar <rashmy@microsoft.com> Co-authored-by: Kaveesh Dubey <kadubey@microsoft.com> Co-authored-by: David Michelman <daweim0@gmail.com> Co-authored-by: Dilip Raghunathan <dilipr@microsoft.com>
jatakiajanvi12
pushed a commit
that referenced
this pull request
Dec 2, 2022
…d) (#311) (#316) * Updatng release history * fixing the plugin logs for emit stream * updating log message * Remove Log Processing from fluentd configuration * Remove plugin references from base_container.data * Dilipr/fluent bit log processing (#126) * Build out_oms.so and include in docker-cimprov package * Adding fluent-bit-config file to base container * PR Feedback * Adding out_oms.conf to base_container.data * PR Feedback * Making the critical section as small as possible * PR Feedback * Fixing the newline bug for Computer, and changing containerId to Id * Dilipr/glide updates (#127) * Updating glide.* files to include lumberjack * containerID="" for pull issues * Using KubeAPI for getting image,name. Adding more logs (#129) * Using KubeAPI for getting image,name. Adding more logs * Moving log file and state file to within the omsagent container * Changing log and state paths * Dilipr/mark comments (#130) * Marks Comments + Error Handling * Drop records from files that are not in k8s format * Remove unnecessary log line' * Adding Log to the file that doesn't conform to the expected format * Rashmi/segfault latest (#132) * adding null checks in all providers * fixing type * fixing type * adding more null checks * update cjson * Adding a missed null check (#135) * reusing some variables (#136) * Rashmi/cjson delete null check (#138) * adding null check for cjson-delete * null chk * removing null check * updating log level to debug for some provider workflows (#139) * Fixing CPU Utilization and removing Fluent-bit filters (#140) Removing fluent-bit filters, CPU optimizations * Minor tweaks 1. Remove some logging 2. Added more Error Handling 3. Continue when there is an error with k8s api (#141) * Removing some logs, added more error checking, continue on kube-api error * Return FLB OK for json Marshall error, instead of RETRY * * Change FluentBit flush interval to 30 secs (from 5 secs) * Remove ContainerPerf, ContainerServiceLog,ContainerProcess (OMI workflows) for Daemonset * Container Log Telemetry * Fixing an issue with Send Init Event if Telemetry is not initialized properly, tab to whitespace in conf file * PR feedback * PR feedback * Sending an event every 5 mins(Heartbeat) (#146) * PR feedback to cleanup removed workflows * updating agent version for telemetry * updating agent version * Telemetry Updates (#149) * Telemetry Fixes 1. Added Log Generation Rate 2. Fixed parsing bugs 3. Added code to send Exceptions/errors * PR Feedback * Changes to send omsagent/omsagent-rs kubectl logs to App Insights (#159) * Changes to send omsagent/omsagent-rs kubectl logs to App Insights * PR Feedback * Rashmi/fluentd docker inventory (#160) * first stab * changes * changes * docker util changes * working tested util * input plugin and conf * changes * changes * changes * changes * changes * working containerinventory * fixing omi removal from container.conf * removing comments * file write and read * deleted containers working * changes * changes * socket timeout * deleting test files * adding log * fixing comment * appinsights changes * changes * tel changes * changes * changes * changes * changes * lib changes * changes * changes * fixes * PR comments * changes * updating the ownership * changes * changes * changes to container data * removing comment * changes * adding collection time * bug fix * env string truncation * changes for acs-engine test * Fix Telemetry Bug -- Initialize Telemetry Client after Initializing all required properties (#162) * Fix kube events memory leak due to yaml serialization for > 5k events (#163) * Setting Timeout for HTTP Client in PostDataHelper in outoms go plugin(#164) * Vishwa/perftelemetry 2 (#165) * add cpu usage telemetry for ds & rs * add cpu & memory usage telemetry for ds & rs * environment variable fix (#166) * environment variable fix * updating agent version * Fixing a bug where we were crashing due to container statuses not present when not was lost (#167) * Updating title * updating right versions for last release * Updating the break condition to look for end of response (#168) * Updating the break condition to look for end of response * changes for docker response * updating AgentVersion for telemetry * Updating readme for latest release changes * Changes - (#173) * use /var/log for state * new metric ContainerLogsAgentSideLatencyMs * new field 'timeOfComand' * Rashmi/kubenodeinventory (#174) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * Get cpuusage from usageseconds (#175) * Rashmi/kubenodeinventory (#176) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * Rashmi/kubenodeinventory (#178) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * Fixing an issue on the cpurate metric, which happens for the first time (when cache is empty) (#179) * Rashmi/kubenodeinventory (#180) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Exclude docker containers from container inventory (#181) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * Exclude pauseamd64 containers from container inventory (#182) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * Update agent version * Updating readme for the latest release * Fix indentation in kube.conf and update readme (#184) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * fixing indentation so that kube.conf contents can be used in config map in the yaml * updating readme to fix date and agent version * updating agent tag * Get Pods for current Node Only (#185) * Fix KubeAPI Calls to filter to get pods for current node * Reinstate log line * changes for container node inventory fixed type (#186) * Fix for mooncake (disable telemetry optionally) (#191) * disable telemetry option * fix a typo * CustomMetrics to ci_feature (#193) Custom Metrics changes to ci_feature * add ContainerNotRunning column to KubePodInventory * merge pr feedback: update name to ContainerStatusReason * Zero Fill for Missing Pod Phases, Change Namespace Dimension to Kubernetes namespace, as it might be confused with metrics namespace in Metrics Explorer (#194) * Zero Fill for Pod Counts by Phase * Change namespace dimension to Kubernetes namespace * No Retries for non 404 4xx errors (#196) * Update agent version for telemetry * Update readme for upcoming (ciprod01202019) release * fix readme formatting * fix formatting for readme * fix formatting for readme * fix readme * fix readme * fix agent version for telemetry * fix date in readme * update readme * Restart logs every 10MB instead of weekly (#198) * Rotate logs every 10MB instead of weekly * Removing some logging, fixed log rotation * update agent version for telemetry * update readme * Update kube.conf to use %STATE_DIR_WS% instead of hardcoded path * Fix AKSEngine Crash (#200) * hotfix * close resp.Body * remove chatty logs * membuf=5m and ignore files not updated since 5 mins * fix readme for new version * Fix the pod count in mdm agent plugin (#203) * Update readme * string freeze for out_mdm plugin * Vishwa/resourcecentric (#208) * resourceid fix (for AKS only) * fix name * Rashmi/win nodepool - PR (#206) * changes for win nodes enumeration * changes * changes * changes * node cpu metric rate changes * container cpu rate * changes * changes * changes * changes * changes * changes to include in_win_cadvisor_perf.rb file * send containerinventoryheartbeatevent * changes * cahnges for mdm metrics * changes * cahnges * changes * container states * changes * changes * changes for env variables * changes * changes * changes * changes * delete comments * changes * mutex changes * changes * changes * changes * telemetry fix for docker version * removing hardcoded values for mdm * update docker version * telemetry for windows cadvisor timeouts * exeception key update to computer * PR comments * adding os to container inventory for windows nodes (#210) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors (#211) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors * Fixing the bug, deferring telemetry changes for later * updating to lowercase compare for units (#212) * Merge from vishwa/telegraftcp to ci_feature for telegraf changes (#214) * merge from Vishwa/telegraf to Vishwa/telegraftcp for telegraf changes (#207) * add configuration for telegraf * fix for perms * fix telegraf config. * fix file location & config * update to config * fix namespace * trying different namespace and also debug=true * add placeholder for nodename * change namespace * updated config * fix uri * fix azMon settings * remove aad settings * add custom metrics regions * fix config * add support for replica-set config * fix oomkilled * Add telegraf 403 metric telemetry & non 403 trace telemetry * fix type * fix package * fix package import * fix filename * delete unused file * conf file for rs; fix 403counttotal metric for telegraf, remove host and use nodeName consistently, rename metrics * fix statefulsets * fix typo. * fix another typo. * fix telemetry * fix casing issue * fix comma issue. * disable telemetry for rs ; fix stateful set name * worksround for namespace fix * telegraf integration - v1 * telemetry changes for telegraf * telemetry & other changes * remove custom metric regions as we dont need anymore * remove un-needed files * fixes * exclude certain volumes and fix telemetry to not have computer & nodename as dimensions (redundant) * Vishwa/resourcecentric (#208) (#209) * resourceid fix (for AKS only) * fix name * near final metric shape * change from customlog to fixed type (InsightsMetrics) * fix PR feedback * fix pr feedback * Fix telemetry error for telegraf err count metric (#215) * Fix Unscheduled Pod bug, remove excess telemetry (#218) * Fix Unscheduled Pod bug, remove excess telemetry * Send Success Telemetry only once after startup for a node in a cluster for MDM Post * Sending telemetry for successful push to MDM every hour * Merge from Vishwa/promstandardmetrics into ci_feature (#220) * enable prometheus metrics collection in replica-set * fixing typos * fix config file path for replicaset * fix configuration * config changes * merge config/settings to ci_feature (#221) * updating fluentbit to use LOG_TAIL_PATH * changes * log exclusion pattern * changes * removing comments * adding enviornment varibale collection/disable * disable env var for cluster variable change * changes * toml parser changes * adding directory tomlrb * changes for container inventory * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Telemetry for config overrides * add schema version telemetry * reduce the number of api calls for namespace filtering add more telemetry for config processing move liveness probe & parser to this repo * optimize for default kube-system namespace log collection exclusion * Fix Scenario when Controller name is empty (#222) * fix ; * ContainerLog collection optimizations (#223) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * merge final changes for release from Vishwa/june2019agentrel to ci_feature (#224) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * fix a minor comment * * change flush from 5 to 10 secs based on perf findings * fix fluent bit tuning for perf run (#226) * fix fluent bit tuning for perf run * stop collecting our own partition * fix merge issue * add release notes for june release in ci_feature branch * fix title * update * fix title * Trim spaces in AKS_REGION (#233) This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Add Logs Size To Telemetry (#234) * Add Logs to telemetry * Using len instead of unsafe.Sizeof * Merge Vishwa/promcustommetrics to ci_feature (#237) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * Fix Region space error (#239) * Trim spaces in AKS_REGION This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Fix out_mdm parsing error * Removing buffer chunk size and buffer max size from fluentbit conf (#240) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * removing buffer chunk size and buffer max size from fluentbit conf * changes (#243) * Collect container last state (#235) * updating the OMS agent to also collect container last state * changed a comment * git surrounded ContainerLastStatus code in a begin/rescue block * added a lot of error checking and logging * Rashmi/fix prom telemetry (#247) * fix prom telemetry * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Merge Health Model work into ci_feature behind a feature flag Pending perf testing (#246) Merge Health to ci_feature * Fix Deserialization Bug (#249) * Fix the bug where capacity is not updated and cached value was being used (#251) * Fix the Capacity computation * fix node cpu and memory limits calculation * changes (#250) * Added new Custom Metrics Regions, fixed MDM plugin crash bug (#253) Added new regions, added handler for MDM plugin start * Add Missing Handlers (#254) * Added Missing Handlers * Return MultiEventStream.new instead of empty array (#256) * Added explicit require_relative to avoid loading errors (#258) * Adding explicit require_relative * Gangams/enable ai telemetry in mc (#252) * enable ai telemetry to configure different ikey and endpoint per cloud * Fixing null check out_mdm bug, tomlparser bug, exposing Replica Set service name as an ENV variable (#261) * Expose replica set service as an env variable * Fixing null check out_mdm bug, and tomlparser bug * Updating the env variable name to be more specific to health model * Changes for creating custom plugins with namespace settings for prometheus scraping (#262) * changes * changes * changes * changes * changes * changes * chnages * changes * telemetry changes * changes * Cherry-pick hotfix 09092019 to ci_feature (#265) * Gangams/add telemetry hybrid (#264) * add telemetry to detect the cloud, distro and kernel version * add null check since providerId optional * detect azurestack cloud * rename to KubernetesProviderID since ProviderID name already used in LA * capture workspaceCloud to the telemetry * trim the domain read from file * KubeMonAgentEvents changes to collect configuration events (#267) * changes * changes * changes * changes * changes * changes * env changes * changes * changes * changes * reverting * changes * cahnges * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * chnages * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Fix the Dupe Perf Data Issue from the DaemonSet (#266) * Dupe Perf Record Fix * PR for 1. Container Memory CPU monitor 2. Configuration for Node Conditions 3. Fixed Type Changes 4. Use Env variable, and health_forward (that handles network errors at init) 5. Unit Tests (#268) * init containers fix and other bug fixes (#269) * init container - KPI and kubeperf changes * changes * changes * changes * changes for empty array fix * changes * changes * pod inventory exception fix * nil check changes * changes * fixing typo * changes * changes * PR - feedback * remove comment * tag pass changes * changes * tagdrop changes * changes * changes * Send agg monitor signal on details change (#270) send when an agg monitor details change, but state did not change * bug fixes for error (#274) * Fix to use declaration and assignment instead of assignment (#275) * bug fixes for error * adding declaration to assignment * 1. Added telemetry (#277) 2. Configuration property changes 3. Bug fixes for a. unscheduled pods returning green 3b. Sometimes, the details hash of agg monitors are different because the order of elements inside the array is different, causing the records to be sent * Bug fix to remove unused variable (#281) * bug fixes for error * adding declaration to assignment * removing unused variable * Fix the WARN<->WARNING typo (#282) * Bug Fixes 1. telemetry send throwing exception if records not initialized 2. permissions error in on-prem clusters (#284) * Bug fixes 1. not writeable, telemetry error * Change to state_WS_dir * Fix Require relative revert (#287) * Bug Fixes for exceptions in telemetry, remove limit set check (#289) * Bug Fixes 10222019 * Initialize container_cpu_memory_records in fhmb * Added telemetry to investigate health exceptions * Set frozen_string_literal to true * Send event once per container when lookup is empty, or limit is an array * Unit Tests, Use RS and POD to determine workload * Fixed Node Condition Bug, added exception handling to return get_rs_owner_ref * Fix the bug where if a warning condition appears before fail condition, the node condition is reported as warning instead of fail. Also fix the node conditions state to consider unknown as a failure state (#292) * Fix for Nodes Aspect not showing up in draft cluster (#294) * Fix the issue where the health tree is inconsistent if a deployment is deleted (#295) * Rashmi/1 16 test (#297) * health deployment update * apps v1 changes for deployment * changes * changes to use relicasets and api groups * Fix duplicate records in container memory/cpu samples (#298) * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * fix exceptions (#306) * Merge Branch morgan into ci_feature (#308) * Fixes : 1) Disable health (for time being) - in DS & RS 2) Disable MDM (for time being) - in DS & RS 3) Merge kubeperf into kubenode & kubepod 4) Made scheduling predictable for kubenode & kubepod 5) Enable containerlog enrichment fields (timeofcommand, containername & containerimage) as a configurable setting (default = true/ON) - Also add telemetry for it 6) Filter OUT type!=Normal events for k8s events 7) AppInsights telemetry async 8) Fix double calling bug in in_win_cadvisor_perf 9) Add connect timeout (20secs) & read timeout (40 secs) for all cadvisor api calls & also for all kubernetes api server calls 10) Fix batchTime for kubepods to be one before making api server call (rather than after making the call, which will make it fluctuate based on api server latency for the call) * fix setting issue for the new enrichcontainerlog setting * fix compilation issue * fix another compilation issue * fix emit issues * fix a nil issue * fix mising tag * * Fix all input plugins for scheduling issue * Merge kubeservices with kubepodinventory (reduce RS to API server by one more) * Remove Kubelogs (not used) * Fix liveness probe * Disable enrichment by default for container logs * Move to yajl json parser across the board for docker provier code * Remove unused files * fix removed files * fix timeofcommand and remove a duplicate entry for a health file. * Rashmi/http leak fixes (#301) * changes for http connection close * close socket in ensure * adding nil check * Rashmi/http leak fixes (#303) * changes for http connection close * close socket in ensure * adding nil check * adding missing end * use yajl for events & nodes parsing. * Rashmi/http leak fixes (#304) * changes for http connection close * close socket in ensure * adding nil check * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * adding missing end * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * changes for chunking * telemetry changes * some fixes * bug fix * changing to have morgan changes only * add new line * use polltime for metrics and disable out_forward for health * enable mdm & health * few optimizations * do not remove time of command make kube.conf same as scale tested config * remove comments from container.conf * remove flush comment for ai telemetry * remove commented code lines * fix config * remove timeofcommand when enrichment==false * fix config * enable mdm filter * Rashmi/api chunk (#307) * changes * changes * refactor changes * changes * changes * changes * changes * node changes * changes * changes * changes * changes * adding open and read timeouts for api client * removing comments * updating chunk size * Update Readme * add back timeofcommand (#310)
jatakiajanvi12
pushed a commit
that referenced
this pull request
Dec 2, 2022
* Updatng release history * fixing the plugin logs for emit stream * updating log message * Remove Log Processing from fluentd configuration * Remove plugin references from base_container.data * Dilipr/fluent bit log processing (#126) * Build out_oms.so and include in docker-cimprov package * Adding fluent-bit-config file to base container * PR Feedback * Adding out_oms.conf to base_container.data * PR Feedback * Making the critical section as small as possible * PR Feedback * Fixing the newline bug for Computer, and changing containerId to Id * Dilipr/glide updates (#127) * Updating glide.* files to include lumberjack * containerID="" for pull issues * Using KubeAPI for getting image,name. Adding more logs (#129) * Using KubeAPI for getting image,name. Adding more logs * Moving log file and state file to within the omsagent container * Changing log and state paths * Dilipr/mark comments (#130) * Marks Comments + Error Handling * Drop records from files that are not in k8s format * Remove unnecessary log line' * Adding Log to the file that doesn't conform to the expected format * Rashmi/segfault latest (#132) * adding null checks in all providers * fixing type * fixing type * adding more null checks * update cjson * Adding a missed null check (#135) * reusing some variables (#136) * Rashmi/cjson delete null check (#138) * adding null check for cjson-delete * null chk * removing null check * updating log level to debug for some provider workflows (#139) * Fixing CPU Utilization and removing Fluent-bit filters (#140) Removing fluent-bit filters, CPU optimizations * Minor tweaks 1. Remove some logging 2. Added more Error Handling 3. Continue when there is an error with k8s api (#141) * Removing some logs, added more error checking, continue on kube-api error * Return FLB OK for json Marshall error, instead of RETRY * * Change FluentBit flush interval to 30 secs (from 5 secs) * Remove ContainerPerf, ContainerServiceLog,ContainerProcess (OMI workflows) for Daemonset * Container Log Telemetry * Fixing an issue with Send Init Event if Telemetry is not initialized properly, tab to whitespace in conf file * PR feedback * PR feedback * Sending an event every 5 mins(Heartbeat) (#146) * PR feedback to cleanup removed workflows * updating agent version for telemetry * updating agent version * Telemetry Updates (#149) * Telemetry Fixes 1. Added Log Generation Rate 2. Fixed parsing bugs 3. Added code to send Exceptions/errors * PR Feedback * Changes to send omsagent/omsagent-rs kubectl logs to App Insights (#159) * Changes to send omsagent/omsagent-rs kubectl logs to App Insights * PR Feedback * Rashmi/fluentd docker inventory (#160) * first stab * changes * changes * docker util changes * working tested util * input plugin and conf * changes * changes * changes * changes * changes * working containerinventory * fixing omi removal from container.conf * removing comments * file write and read * deleted containers working * changes * changes * socket timeout * deleting test files * adding log * fixing comment * appinsights changes * changes * tel changes * changes * changes * changes * changes * lib changes * changes * changes * fixes * PR comments * changes * updating the ownership * changes * changes * changes to container data * removing comment * changes * adding collection time * bug fix * env string truncation * changes for acs-engine test * Fix Telemetry Bug -- Initialize Telemetry Client after Initializing all required properties (#162) * Fix kube events memory leak due to yaml serialization for > 5k events (#163) * Setting Timeout for HTTP Client in PostDataHelper in outoms go plugin(#164) * Vishwa/perftelemetry 2 (#165) * add cpu usage telemetry for ds & rs * add cpu & memory usage telemetry for ds & rs * environment variable fix (#166) * environment variable fix * updating agent version * Fixing a bug where we were crashing due to container statuses not present when not was lost (#167) * Updating title * updating right versions for last release * Updating the break condition to look for end of response (#168) * Updating the break condition to look for end of response * changes for docker response * updating AgentVersion for telemetry * Updating readme for latest release changes * Changes - (#173) * use /var/log for state * new metric ContainerLogsAgentSideLatencyMs * new field 'timeOfComand' * Rashmi/kubenodeinventory (#174) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * Get cpuusage from usageseconds (#175) * Rashmi/kubenodeinventory (#176) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * Rashmi/kubenodeinventory (#178) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * Fixing an issue on the cpurate metric, which happens for the first time (when cache is empty) (#179) * Rashmi/kubenodeinventory (#180) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Exclude docker containers from container inventory (#181) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * Exclude pauseamd64 containers from container inventory (#182) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * Update agent version * Updating readme for the latest release * Fix indentation in kube.conf and update readme (#184) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * fixing indentation so that kube.conf contents can be used in config map in the yaml * updating readme to fix date and agent version * updating agent tag * Get Pods for current Node Only (#185) * Fix KubeAPI Calls to filter to get pods for current node * Reinstate log line * changes for container node inventory fixed type (#186) * Fix for mooncake (disable telemetry optionally) (#191) * disable telemetry option * fix a typo * CustomMetrics to ci_feature (#193) Custom Metrics changes to ci_feature * add ContainerNotRunning column to KubePodInventory * merge pr feedback: update name to ContainerStatusReason * Zero Fill for Missing Pod Phases, Change Namespace Dimension to Kubernetes namespace, as it might be confused with metrics namespace in Metrics Explorer (#194) * Zero Fill for Pod Counts by Phase * Change namespace dimension to Kubernetes namespace * No Retries for non 404 4xx errors (#196) * Update agent version for telemetry * Update readme for upcoming (ciprod01202019) release * fix readme formatting * fix formatting for readme * fix formatting for readme * fix readme * fix readme * fix agent version for telemetry * fix date in readme * update readme * Restart logs every 10MB instead of weekly (#198) * Rotate logs every 10MB instead of weekly * Removing some logging, fixed log rotation * update agent version for telemetry * update readme * Update kube.conf to use %STATE_DIR_WS% instead of hardcoded path * Fix AKSEngine Crash (#200) * hotfix * close resp.Body * remove chatty logs * membuf=5m and ignore files not updated since 5 mins * fix readme for new version * Fix the pod count in mdm agent plugin (#203) * Update readme * string freeze for out_mdm plugin * Vishwa/resourcecentric (#208) * resourceid fix (for AKS only) * fix name * Rashmi/win nodepool - PR (#206) * changes for win nodes enumeration * changes * changes * changes * node cpu metric rate changes * container cpu rate * changes * changes * changes * changes * changes * changes to include in_win_cadvisor_perf.rb file * send containerinventoryheartbeatevent * changes * cahnges for mdm metrics * changes * cahnges * changes * container states * changes * changes * changes for env variables * changes * changes * changes * changes * delete comments * changes * mutex changes * changes * changes * changes * telemetry fix for docker version * removing hardcoded values for mdm * update docker version * telemetry for windows cadvisor timeouts * exeception key update to computer * PR comments * adding os to container inventory for windows nodes (#210) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors (#211) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors * Fixing the bug, deferring telemetry changes for later * updating to lowercase compare for units (#212) * Merge from vishwa/telegraftcp to ci_feature for telegraf changes (#214) * merge from Vishwa/telegraf to Vishwa/telegraftcp for telegraf changes (#207) * add configuration for telegraf * fix for perms * fix telegraf config. * fix file location & config * update to config * fix namespace * trying different namespace and also debug=true * add placeholder for nodename * change namespace * updated config * fix uri * fix azMon settings * remove aad settings * add custom metrics regions * fix config * add support for replica-set config * fix oomkilled * Add telegraf 403 metric telemetry & non 403 trace telemetry * fix type * fix package * fix package import * fix filename * delete unused file * conf file for rs; fix 403counttotal metric for telegraf, remove host and use nodeName consistently, rename metrics * fix statefulsets * fix typo. * fix another typo. * fix telemetry * fix casing issue * fix comma issue. * disable telemetry for rs ; fix stateful set name * worksround for namespace fix * telegraf integration - v1 * telemetry changes for telegraf * telemetry & other changes * remove custom metric regions as we dont need anymore * remove un-needed files * fixes * exclude certain volumes and fix telemetry to not have computer & nodename as dimensions (redundant) * Vishwa/resourcecentric (#208) (#209) * resourceid fix (for AKS only) * fix name * near final metric shape * change from customlog to fixed type (InsightsMetrics) * fix PR feedback * fix pr feedback * Fix telemetry error for telegraf err count metric (#215) * Fix Unscheduled Pod bug, remove excess telemetry (#218) * Fix Unscheduled Pod bug, remove excess telemetry * Send Success Telemetry only once after startup for a node in a cluster for MDM Post * Sending telemetry for successful push to MDM every hour * Merge from Vishwa/promstandardmetrics into ci_feature (#220) * enable prometheus metrics collection in replica-set * fixing typos * fix config file path for replicaset * fix configuration * config changes * merge config/settings to ci_feature (#221) * updating fluentbit to use LOG_TAIL_PATH * changes * log exclusion pattern * changes * removing comments * adding enviornment varibale collection/disable * disable env var for cluster variable change * changes * toml parser changes * adding directory tomlrb * changes for container inventory * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Telemetry for config overrides * add schema version telemetry * reduce the number of api calls for namespace filtering add more telemetry for config processing move liveness probe & parser to this repo * optimize for default kube-system namespace log collection exclusion * Fix Scenario when Controller name is empty (#222) * fix ; * ContainerLog collection optimizations (#223) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * merge final changes for release from Vishwa/june2019agentrel to ci_feature (#224) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * fix a minor comment * * change flush from 5 to 10 secs based on perf findings * fix fluent bit tuning for perf run (#226) * fix fluent bit tuning for perf run * stop collecting our own partition * fix merge issue * add release notes for june release in ci_feature branch * fix title * update * fix title * Trim spaces in AKS_REGION (#233) This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Add Logs Size To Telemetry (#234) * Add Logs to telemetry * Using len instead of unsafe.Sizeof * Merge Vishwa/promcustommetrics to ci_feature (#237) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * Fix Region space error (#239) * Trim spaces in AKS_REGION This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Fix out_mdm parsing error * Removing buffer chunk size and buffer max size from fluentbit conf (#240) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * removing buffer chunk size and buffer max size from fluentbit conf * changes (#243) * Collect container last state (#235) * updating the OMS agent to also collect container last state * changed a comment * git surrounded ContainerLastStatus code in a begin/rescue block * added a lot of error checking and logging * Rashmi/fix prom telemetry (#247) * fix prom telemetry * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Merge Health Model work into ci_feature behind a feature flag Pending perf testing (#246) Merge Health to ci_feature * Fix Deserialization Bug (#249) * Fix the bug where capacity is not updated and cached value was being used (#251) * Fix the Capacity computation * fix node cpu and memory limits calculation * changes (#250) * Added new Custom Metrics Regions, fixed MDM plugin crash bug (#253) Added new regions, added handler for MDM plugin start * Add Missing Handlers (#254) * Added Missing Handlers * Return MultiEventStream.new instead of empty array (#256) * Added explicit require_relative to avoid loading errors (#258) * Adding explicit require_relative * Gangams/enable ai telemetry in mc (#252) * enable ai telemetry to configure different ikey and endpoint per cloud * Fixing null check out_mdm bug, tomlparser bug, exposing Replica Set service name as an ENV variable (#261) * Expose replica set service as an env variable * Fixing null check out_mdm bug, and tomlparser bug * Updating the env variable name to be more specific to health model * Changes for creating custom plugins with namespace settings for prometheus scraping (#262) * changes * changes * changes * changes * changes * changes * chnages * changes * telemetry changes * changes * Cherry-pick hotfix 09092019 to ci_feature (#265) * Gangams/add telemetry hybrid (#264) * add telemetry to detect the cloud, distro and kernel version * add null check since providerId optional * detect azurestack cloud * rename to KubernetesProviderID since ProviderID name already used in LA * capture workspaceCloud to the telemetry * trim the domain read from file * KubeMonAgentEvents changes to collect configuration events (#267) * changes * changes * changes * changes * changes * changes * env changes * changes * changes * changes * reverting * changes * cahnges * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * chnages * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Fix the Dupe Perf Data Issue from the DaemonSet (#266) * Dupe Perf Record Fix * PR for 1. Container Memory CPU monitor 2. Configuration for Node Conditions 3. Fixed Type Changes 4. Use Env variable, and health_forward (that handles network errors at init) 5. Unit Tests (#268) * init containers fix and other bug fixes (#269) * init container - KPI and kubeperf changes * changes * changes * changes * changes for empty array fix * changes * changes * pod inventory exception fix * nil check changes * changes * fixing typo * changes * changes * PR - feedback * remove comment * tag pass changes * changes * tagdrop changes * changes * changes * Send agg monitor signal on details change (#270) send when an agg monitor details change, but state did not change * bug fixes for error (#274) * Fix to use declaration and assignment instead of assignment (#275) * bug fixes for error * adding declaration to assignment * 1. Added telemetry (#277) 2. Configuration property changes 3. Bug fixes for a. unscheduled pods returning green 3b. Sometimes, the details hash of agg monitors are different because the order of elements inside the array is different, causing the records to be sent * Bug fix to remove unused variable (#281) * bug fixes for error * adding declaration to assignment * removing unused variable * Fix the WARN<->WARNING typo (#282) * Bug Fixes 1. telemetry send throwing exception if records not initialized 2. permissions error in on-prem clusters (#284) * Bug fixes 1. not writeable, telemetry error * Change to state_WS_dir * Fix Require relative revert (#287) * Bug Fixes for exceptions in telemetry, remove limit set check (#289) * Bug Fixes 10222019 * Initialize container_cpu_memory_records in fhmb * Added telemetry to investigate health exceptions * Set frozen_string_literal to true * Send event once per container when lookup is empty, or limit is an array * Unit Tests, Use RS and POD to determine workload * Fixed Node Condition Bug, added exception handling to return get_rs_owner_ref * Fix the bug where if a warning condition appears before fail condition, the node condition is reported as warning instead of fail. Also fix the node conditions state to consider unknown as a failure state (#292) * Fix for Nodes Aspect not showing up in draft cluster (#294) * Fix the issue where the health tree is inconsistent if a deployment is deleted (#295) * Rashmi/1 16 test (#297) * health deployment update * apps v1 changes for deployment * changes * changes to use relicasets and api groups * Fix duplicate records in container memory/cpu samples (#298) * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * fix exceptions (#306) * Merge Branch morgan into ci_feature (#308) * Fixes : 1) Disable health (for time being) - in DS & RS 2) Disable MDM (for time being) - in DS & RS 3) Merge kubeperf into kubenode & kubepod 4) Made scheduling predictable for kubenode & kubepod 5) Enable containerlog enrichment fields (timeofcommand, containername & containerimage) as a configurable setting (default = true/ON) - Also add telemetry for it 6) Filter OUT type!=Normal events for k8s events 7) AppInsights telemetry async 8) Fix double calling bug in in_win_cadvisor_perf 9) Add connect timeout (20secs) & read timeout (40 secs) for all cadvisor api calls & also for all kubernetes api server calls 10) Fix batchTime for kubepods to be one before making api server call (rather than after making the call, which will make it fluctuate based on api server latency for the call) * fix setting issue for the new enrichcontainerlog setting * fix compilation issue * fix another compilation issue * fix emit issues * fix a nil issue * fix mising tag * * Fix all input plugins for scheduling issue * Merge kubeservices with kubepodinventory (reduce RS to API server by one more) * Remove Kubelogs (not used) * Fix liveness probe * Disable enrichment by default for container logs * Move to yajl json parser across the board for docker provier code * Remove unused files * fix removed files * fix timeofcommand and remove a duplicate entry for a health file. * Rashmi/http leak fixes (#301) * changes for http connection close * close socket in ensure * adding nil check * Rashmi/http leak fixes (#303) * changes for http connection close * close socket in ensure * adding nil check * adding missing end * use yajl for events & nodes parsing. * Rashmi/http leak fixes (#304) * changes for http connection close * close socket in ensure * adding nil check * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * adding missing end * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * changes for chunking * telemetry changes * some fixes * bug fix * changing to have morgan changes only * add new line * use polltime for metrics and disable out_forward for health * enable mdm & health * few optimizations * do not remove time of command make kube.conf same as scale tested config * remove comments from container.conf * remove flush comment for ai telemetry * remove commented code lines * fix config * remove timeofcommand when enrichment==false * fix config * enable mdm filter * Rashmi/api chunk (#307) * changes * changes * refactor changes * changes * changes * changes * changes * node changes * changes * changes * changes * changes * adding open and read timeouts for api client * removing comments * updating chunk size * Update Readme * add back timeofcommand (#310) * update readme for timeofcommand fix (#314) * Merge from ci_feature_prod into ci_feature (fix put back timeofcommand) (#311) (#316) * Updatng release history * fixing the plugin logs for emit stream * updating log message * Remove Log Processing from fluentd configuration * Remove plugin references from base_container.data * Dilipr/fluent bit log processing (#126) * Build out_oms.so and include in docker-cimprov package * Adding fluent-bit-config file to base container * PR Feedback * Adding out_oms.conf to base_container.data * PR Feedback * Making the critical section as small as possible * PR Feedback * Fixing the newline bug for Computer, and changing containerId to Id * Dilipr/glide updates (#127) * Updating glide.* files to include lumberjack * containerID="" for pull issues * Using KubeAPI for getting image,name. Adding more logs (#129) * Using KubeAPI for getting image,name. Adding more logs * Moving log file and state file to within the omsagent container * Changing log and state paths * Dilipr/mark comments (#130) * Marks Comments + Error Handling * Drop records from files that are not in k8s format * Remove unnecessary log line' * Adding Log to the file that doesn't conform to the expected format * Rashmi/segfault latest (#132) * adding null checks in all providers * fixing type * fixing type * adding more null checks * update cjson * Adding a missed null check (#135) * reusing some variables (#136) * Rashmi/cjson delete null check (#138) * adding null check for cjson-delete * null chk * removing null check * updating log level to debug for some provider workflows (#139) * Fixing CPU Utilization and removing Fluent-bit filters (#140) Removing fluent-bit filters, CPU optimizations * Minor tweaks 1. Remove some logging 2. Added more Error Handling 3. Continue when there is an error with k8s api (#141) * Removing some logs, added more error checking, continue on kube-api error * Return FLB OK for json Marshall error, instead of RETRY * * Change FluentBit flush interval to 30 secs (from 5 secs) * Remove ContainerPerf, ContainerServiceLog,ContainerProcess (OMI workflows) for Daemonset * Container Log Telemetry * Fixing an issue with Send Init Event if Telemetry is not initialized properly, tab to whitespace in conf file * PR feedback * PR feedback * Sending an event every 5 mins(Heartbeat) (#146) * PR feedback to cleanup removed workflows * updating agent version for telemetry * updating agent version * Telemetry Updates (#149) * Telemetry Fixes 1. Added Log Generation Rate 2. Fixed parsing bugs 3. Added code to send Exceptions/errors * PR Feedback * Changes to send omsagent/omsagent-rs kubectl logs to App Insights (#159) * Changes to send omsagent/omsagent-rs kubectl logs to App Insights * PR Feedback * Rashmi/fluentd docker inventory (#160) * first stab * changes * changes * docker util changes * working tested util * input plugin and conf * changes * changes * changes * changes * changes * working containerinventory * fixing omi removal from container.conf * removing comments * file write and read * deleted containers working * changes * changes * socket timeout * deleting test files * adding log * fixing comment * appinsights changes * changes * tel changes * changes * changes * changes * changes * lib changes * changes * changes * fixes * PR comments * changes * updating the ownership * changes * changes * changes to container data * removing comment * changes * adding collection time * bug fix * env string truncation * changes for acs-engine test * Fix Telemetry Bug -- Initialize Telemetry Client after Initializing all required properties (#162) * Fix kube events memory leak due to yaml serialization for > 5k events (#163) * Setting Timeout for HTTP Client in PostDataHelper in outoms go plugin(#164) * Vishwa/perftelemetry 2 (#165) * add cpu usage telemetry for ds & rs * add cpu & memory usage telemetry for ds & rs * environment variable fix (#166) * environment variable fix * updating agent version * Fixing a bug where we were crashing due to container statuses not present when not was lost (#167) * Updating title * updating right versions for last release * Updating the break condition to look for end of response (#168) * Updating the break condition to look for end of response * changes for docker response * updating AgentVersion for telemetry * Updating readme for latest release changes * Changes - (#173) * use /var/log for state * new metric ContainerLogsAgentSideLatencyMs * new field 'timeOfComand' * Rashmi/kubenodeinventory (#174) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * Get cpuusage from usageseconds (#175) * Rashmi/kubenodeinventory (#176) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * Rashmi/kubenodeinventory (#178) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * Fixing an issue on the cpurate metric, which happens for the first time (when cache is empty) (#179) * Rashmi/kubenodeinventory (#180) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Exclude docker containers from container inventory (#181) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * Exclude pauseamd64 containers from container inventory (#182) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * Update agent version * Updating readme for the latest release * Fix indentation in kube.conf and update readme (#184) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * fixing indentation so that kube.conf contents can be used in config map in the yaml * updating readme to fix date and agent version * updating agent tag * Get Pods for current Node Only (#185) * Fix KubeAPI Calls to filter to get pods for current node * Reinstate log line * changes for container node inventory fixed type (#186) * Fix for mooncake (disable telemetry optionally) (#191) * disable telemetry option * fix a typo * CustomMetrics to ci_feature (#193) Custom Metrics changes to ci_feature * add ContainerNotRunning column to KubePodInventory * merge pr feedback: update name to ContainerStatusReason * Zero Fill for Missing Pod Phases, Change Namespace Dimension to Kubernetes namespace, as it might be confused with metrics namespace in Metrics Explorer (#194) * Zero Fill for Pod Counts by Phase * Change namespace dimension to Kubernetes namespace * No Retries for non 404 4xx errors (#196) * Update agent version for telemetry * Update readme for upcoming (ciprod01202019) release * fix readme formatting * fix formatting for readme * fix formatting for readme * fix readme * fix readme * fix agent version for telemetry * fix date in readme * update readme * Restart logs every 10MB instead of weekly (#198) * Rotate logs every 10MB instead of weekly * Removing some logging, fixed log rotation * update agent version for telemetry * update readme * Update kube.conf to use %STATE_DIR_WS% instead of hardcoded path * Fix AKSEngine Crash (#200) * hotfix * close resp.Body * remove chatty logs * membuf=5m and ignore files not updated since 5 mins * fix readme for new version * Fix the pod count in mdm agent plugin (#203) * Update readme * string freeze for out_mdm plugin * Vishwa/resourcecentric (#208) * resourceid fix (for AKS only) * fix name * Rashmi/win nodepool - PR (#206) * changes for win nodes enumeration * changes * changes * changes * node cpu metric rate changes * container cpu rate * changes * changes * changes * changes * changes * changes to include in_win_cadvisor_perf.rb file * send containerinventoryheartbeatevent * changes * cahnges for mdm metrics * changes * cahnges * changes * container states * changes * changes * changes for env variables * changes * changes * changes * changes * delete comments * changes * mutex changes * changes * changes * changes * telemetry fix for docker version * removing hardcoded values for mdm * update docker version * telemetry for windows cadvisor timeouts * exeception key update to computer * PR comments * adding os to container inventory for windows nodes (#210) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors (#211) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors * Fixing the bug, deferring telemetry changes for later * updating to lowercase compare for units (#212) * Merge from vishwa/telegraftcp to ci_feature for telegraf changes (#214) * merge from Vishwa/telegraf to Vishwa/telegraftcp for telegraf changes (#207) * add configuration for telegraf * fix for perms * fix telegraf config. * fix file location & config * update to config * fix namespace * trying different namespace and also debug=true * add placeholder for nodename * change namespace * updated config * fix uri * fix azMon settings * remove aad settings * add custom metrics regions * fix config * add support for replica-set config * fix oomkilled * Add telegraf 403 metric telemetry & non 403 trace telemetry * fix type * fix package * fix package import * fix filename * delete unused file * conf file for rs; fix 403counttotal metric for telegraf, remove host and use nodeName consistently, rename metrics * fix statefulsets * fix typo. * fix another typo. * fix telemetry * fix casing issue * fix comma issue. * disable telemetry for rs ; fix stateful set name * worksround for namespace fix * telegraf integration - v1 * telemetry changes for telegraf * telemetry & other changes * remove custom metric regions as we dont need anymore * remove un-needed files * fixes * exclude certain volumes and fix telemetry to not have computer & nodename as dimensions (redundant) * Vishwa/resourcecentric (#208) (#209) * resourceid fix (for AKS only) * fix name * near final metric shape * change from customlog to fixed type (InsightsMetrics) * fix PR feedback * fix pr feedback * Fix telemetry error for telegraf err count metric (#215) * Fix Unscheduled Pod bug, remove excess telemetry (#218) * Fix Unscheduled Pod bug, remove excess telemetry * Send Success Telemetry only once after startup for a node in a cluster for MDM Post * Sending telemetry for successful push to MDM every hour * Merge from Vishwa/promstandardmetrics into ci_feature (#220) * enable prometheus metrics collection in replica-set * fixing typos * fix config file path for replicaset * fix configuration * config changes * merge config/settings to ci_feature (#221) * updating fluentbit to use LOG_TAIL_PATH * changes * log exclusion pattern * changes * removing comments * adding enviornment varibale collection/disable * disable env var for cluster variable change * changes * toml parser changes * adding directory tomlrb * changes for container inventory * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Telemetry for config overrides * add schema version telemetry * reduce the number of api calls for namespace filtering add more telemetry for config processing move liveness probe & parser to this repo * optimize for default kube-system namespace log collection exclusion * Fix Scenario when Controller name is empty (#222) * fix ; * ContainerLog collection optimizations (#223) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * merge final changes for release from Vishwa/june2019agentrel to ci_feature (#224) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * fix a minor comment * * change flush from 5 to 10 secs based on perf findings * fix fluent bit tuning for perf run (#226) * fix fluent bit tuning for perf run * stop collecting our own partition * fix merge issue * add release notes for june release in ci_feature branch * fix title * update * fix title * Trim spaces in AKS_REGION (#233) This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Add Logs Size To Telemetry (#234) * Add Logs to telemetry * Using len instead of unsafe.Sizeof * Merge Vishwa/promcustommetrics to ci_feature (#237) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * Fix Region space error (#239) * Trim spaces in AKS_REGION This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Fix out_mdm parsing error * Removing buffer chunk size and buffer max size from fluentbit conf (#240) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * removing buffer chunk size and buffer max size from fluentbit conf * changes (#243) * Collect container last state (#235) * updating the OMS agent to also collect container last state * changed a comment * git surrounded ContainerLastStatus code in a begin/rescue block * added a lot of error checking and logging * Rashmi/fix prom telemetry (#247) * fix prom telemetry * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Merge Health Model work into ci_feature behind a feature flag Pending perf testing (#246) Merge Health to ci_feature * Fix Deserialization Bug (#249) * Fix the bug where capacity is not updated and cached value was being used (#251) * Fix the Capacity computation * fix node cpu and memory limits calculation * changes (#250) * Added new Custom Metrics Regions, fixed MDM plugin crash bug (#253) Added new regions, added handler for MDM plugin start * Add Missing Handlers (#254) * Added Missing Handlers * Return MultiEventStream.new instead of empty array (#256) * Added explicit require_relative to avoid loading errors (#258) * Adding explicit require_relative * Gangams/enable ai telemetry in mc (#252) * enable ai telemetry to configure different ikey and endpoint per cloud * Fixing null check out_mdm bug, tomlparser bug, exposing Replica Set service name as an ENV variable (#261) * Expose replica set service as an env variable * Fixing null check out_mdm bug, and tomlparser bug * Updating the env variable name to be more specific to health model * Changes for creating custom plugins with namespace settings for prometheus scraping (#262) * changes * changes * changes * changes * changes * changes * chnages * changes * telemetry changes * changes * Cherry-pick hotfix 09092019 to ci_feature (#265) * Gangams/add telemetry hybrid (#264) * add telemetry to detect the cloud, distro and kernel version * add null check since providerId optional * detect azurestack cloud * rename to KubernetesProviderID since ProviderID name already used in LA * capture workspaceCloud to the telemetry * trim the domain read from file * KubeMonAgentEvents changes to collect configuration events (#267) * changes * changes * changes * changes * changes * changes * env changes * changes * changes * changes * reverting * changes * cahnges * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * chnages * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Fix the Dupe Perf Data Issue from the DaemonSet (#266) * Dupe Perf Record Fix * PR for 1. Container Memory CPU monitor 2. Configuration for Node Conditions 3. Fixed Type Changes 4. Use Env variable, and health_forward (that handles network errors at init) 5. Unit Tests (#268) * init containers fix and other bug fixes (#269) * init container - KPI and kubeperf changes * changes * changes * changes * changes for empty array fix * changes * changes * pod inventory exception fix * nil check changes * changes * fixing typo * changes * changes * PR - feedback * remove comment * tag pass changes * changes * tagdrop changes * changes * changes * Send agg monitor signal on details change (#270) send when an agg monitor details change, but state did not change * bug fixes for error (#274) * Fix to use declaration and assignment instead of assignment (#275) * bug fixes for error * adding declaration to assignment * 1. Added telemetry (#277) 2. Configuration property changes 3. Bug fixes for a. unscheduled pods returning green 3b. Sometimes, the details hash of agg monitors are different because the order of elements inside the array is different, causing the records to be sent * Bug fix to remove unused variable (#281) * bug fixes for error * adding declaration to assignment * removing unused variable * Fix the WARN<->WARNING typo (#282) * Bug Fixes 1. telemetry send throwing exception if records not initialized 2. permissions error in on-prem clusters (#284) * Bug fixes 1. not writeable, telemetry error * Change to state_WS_dir * Fix Require relative revert (#287) * Bug Fixes for exceptions in telemetry, remove limit set check (#289) * Bug Fixes 10222019 * Initialize container_cpu_memory_records in fhmb * Added telemetry to investigate health exceptions * Set frozen_string_literal to true * Send event once per container when lookup is empty, or limit is an array * Unit Tests, Use RS and POD to determine workload * Fixed Node Condition Bug, added exception handling to return get_rs_owner_ref * Fix the bug where if a warning condition appears before fail condition, the node condition is reported as warning instead of fail. Also fix the node conditions state to consider unknown as a failure state (#292) * Fix for Nodes Aspect not showing up in draft cluster (#294) * Fix the issue where the health tree is inconsistent if a deployment is deleted (#295) * Rashmi/1 16 test (#297) * health deployment update * apps v1 changes for deployment * changes * changes to use relicasets and api groups * Fix duplicate records in container memory/cpu samples (#298) * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * fix exceptions (#306) * Merge Branch morgan into ci_feature (#308) * Fixes : 1) Disable health (for time being) - in DS & RS 2) Disable MDM (for time being) - in DS & RS 3) Merge kubeperf into kubenode & kubepod 4) Made scheduling predictable for kubenode & kubepod 5) Enable containerlog enrichment fields (timeofcommand, containername & containerimage) as a configurable setting (default = true/ON) - Also add telemetry for it 6) Filter OUT type!=Normal events for k8s events 7) AppInsights telemetry async 8) Fix double calling bug in in_win_cadvisor_perf 9) Add connect timeout (20secs) & read timeout (40 secs) for all cadvisor api calls & also for all kubernetes api server calls 10) Fix batchTime for kubepods to be one before making api server call (rather than after making the call, which will make it fluctuate based on api server latency for the call) * fix setting issue for the new enrichcontainerlog setting * fix compilation issue * fix another compilation issue * fix emit issues * fix a nil issue * fix mising tag * * Fix all input plugins for scheduling issue * Merge kubeservices with kubepodinventory (reduce RS to API server by one more) * Remove Kubelogs (not used) * Fix liveness probe * Disable enrichment by default for container logs * Move to yajl json parser across the board for docker provier code * Remove unused files * fix removed files * fix timeofcommand and remove a duplicate entry for a health file. * Rashmi/http leak fixes (#301) * changes for http connection close * close socket in ensure * adding nil check * Rashmi/http leak fixes (#303) * changes for http connection close * close socket in ensure * adding nil check * adding missing end * use yajl for events & nodes parsing. * Rashmi/http leak fixes (#304) * changes for http connection close * close socket in ensure * adding nil check * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * adding missing end * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * changes for chunking * telemetry changes * some fixes * bug fix * changing to have morgan changes only * add new line * use polltime for metrics and disable out_forward for health * enable mdm & health * few optimizations * do not remove time of command make kube.conf same as scale tested config * remove comments from container.conf * remove flush comment for ai telemetry * remove commented code lines * fix config * remove timeofcommand when enrichment==false * fix config * enable mdm filter * Rashmi/api chunk (#307) * changes * changes * refactor changes * changes * changes * changes * changes * node changes * changes * changes * changes * changes * adding open and read timeouts for api client * removing comments * updating chunk size * Update Readme * add back timeofcommand (#310)
jatakiajanvi12
pushed a commit
that referenced
this pull request
Dec 2, 2022
* Updatng release history * fixing the plugin logs for emit stream * updating log message * Remove Log Processing from fluentd configuration * Remove plugin references from base_container.data * Dilipr/fluent bit log processing (#126) * Build out_oms.so and include in docker-cimprov package * Adding fluent-bit-config file to base container * PR Feedback * Adding out_oms.conf to base_container.data * PR Feedback * Making the critical section as small as possible * PR Feedback * Fixing the newline bug for Computer, and changing containerId to Id * Dilipr/glide updates (#127) * Updating glide.* files to include lumberjack * containerID="" for pull issues * Using KubeAPI for getting image,name. Adding more logs (#129) * Using KubeAPI for getting image,name. Adding more logs * Moving log file and state file to within the omsagent container * Changing log and state paths * Dilipr/mark comments (#130) * Marks Comments + Error Handling * Drop records from files that are not in k8s format * Remove unnecessary log line' * Adding Log to the file that doesn't conform to the expected format * Rashmi/segfault latest (#132) * adding null checks in all providers * fixing type * fixing type * adding more null checks * update cjson * Adding a missed null check (#135) * reusing some variables (#136) * Rashmi/cjson delete null check (#138) * adding null check for cjson-delete * null chk * removing null check * updating log level to debug for some provider workflows (#139) * Fixing CPU Utilization and removing Fluent-bit filters (#140) Removing fluent-bit filters, CPU optimizations * Minor tweaks 1. Remove some logging 2. Added more Error Handling 3. Continue when there is an error with k8s api (#141) * Removing some logs, added more error checking, continue on kube-api error * Return FLB OK for json Marshall error, instead of RETRY * * Change FluentBit flush interval to 30 secs (from 5 secs) * Remove ContainerPerf, ContainerServiceLog,ContainerProcess (OMI workflows) for Daemonset * Container Log Telemetry * Fixing an issue with Send Init Event if Telemetry is not initialized properly, tab to whitespace in conf file * PR feedback * PR feedback * Sending an event every 5 mins(Heartbeat) (#146) * PR feedback to cleanup removed workflows * updating agent version for telemetry * updating agent version * Telemetry Updates (#149) * Telemetry Fixes 1. Added Log Generation Rate 2. Fixed parsing bugs 3. Added code to send Exceptions/errors * PR Feedback * Changes to send omsagent/omsagent-rs kubectl logs to App Insights (#159) * Changes to send omsagent/omsagent-rs kubectl logs to App Insights * PR Feedback * Rashmi/fluentd docker inventory (#160) * first stab * changes * changes * docker util changes * working tested util * input plugin and conf * changes * changes * changes * changes * changes * working containerinventory * fixing omi removal from container.conf * removing comments * file write and read * deleted containers working * changes * changes * socket timeout * deleting test files * adding log * fixing comment * appinsights changes * changes * tel changes * changes * changes * changes * changes * lib changes * changes * changes * fixes * PR comments * changes * updating the ownership * changes * changes * changes to container data * removing comment * changes * adding collection time * bug fix * env string truncation * changes for acs-engine test * Fix Telemetry Bug -- Initialize Telemetry Client after Initializing all required properties (#162) * Fix kube events memory leak due to yaml serialization for > 5k events (#163) * Setting Timeout for HTTP Client in PostDataHelper in outoms go plugin(#164) * Vishwa/perftelemetry 2 (#165) * add cpu usage telemetry for ds & rs * add cpu & memory usage telemetry for ds & rs * environment variable fix (#166) * environment variable fix * updating agent version * Fixing a bug where we were crashing due to container statuses not present when not was lost (#167) * Updating title * updating right versions for last release * Updating the break condition to look for end of response (#168) * Updating the break condition to look for end of response * changes for docker response * updating AgentVersion for telemetry * Updating readme for latest release changes * Changes - (#173) * use /var/log for state * new metric ContainerLogsAgentSideLatencyMs * new field 'timeOfComand' * Rashmi/kubenodeinventory (#174) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * Get cpuusage from usageseconds (#175) * Rashmi/kubenodeinventory (#176) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * Rashmi/kubenodeinventory (#178) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * Fixing an issue on the cpurate metric, which happens for the first time (when cache is empty) (#179) * Rashmi/kubenodeinventory (#180) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Exclude docker containers from container inventory (#181) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * Exclude pauseamd64 containers from container inventory (#182) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * Update agent version * Updating readme for the latest release * Fix indentation in kube.conf and update readme (#184) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * fixing indentation so that kube.conf contents can be used in config map in the yaml * updating readme to fix date and agent version * updating agent tag * Get Pods for current Node Only (#185) * Fix KubeAPI Calls to filter to get pods for current node * Reinstate log line * changes for container node inventory fixed type (#186) * Fix for mooncake (disable telemetry optionally) (#191) * disable telemetry option * fix a typo * CustomMetrics to ci_feature (#193) Custom Metrics changes to ci_feature * add ContainerNotRunning column to KubePodInventory * merge pr feedback: update name to ContainerStatusReason * Zero Fill for Missing Pod Phases, Change Namespace Dimension to Kubernetes namespace, as it might be confused with metrics namespace in Metrics Explorer (#194) * Zero Fill for Pod Counts by Phase * Change namespace dimension to Kubernetes namespace * No Retries for non 404 4xx errors (#196) * Update agent version for telemetry * Update readme for upcoming (ciprod01202019) release * fix readme formatting * fix formatting for readme * fix formatting for readme * fix readme * fix readme * fix agent version for telemetry * fix date in readme * update readme * Restart logs every 10MB instead of weekly (#198) * Rotate logs every 10MB instead of weekly * Removing some logging, fixed log rotation * update agent version for telemetry * update readme * Update kube.conf to use %STATE_DIR_WS% instead of hardcoded path * Fix AKSEngine Crash (#200) * hotfix * close resp.Body * remove chatty logs * membuf=5m and ignore files not updated since 5 mins * fix readme for new version * Fix the pod count in mdm agent plugin (#203) * Update readme * string freeze for out_mdm plugin * Vishwa/resourcecentric (#208) * resourceid fix (for AKS only) * fix name * Rashmi/win nodepool - PR (#206) * changes for win nodes enumeration * changes * changes * changes * node cpu metric rate changes * container cpu rate * changes * changes * changes * changes * changes * changes to include in_win_cadvisor_perf.rb file * send containerinventoryheartbeatevent * changes * cahnges for mdm metrics * changes * cahnges * changes * container states * changes * changes * changes for env variables * changes * changes * changes * changes * delete comments * changes * mutex changes * changes * changes * changes * telemetry fix for docker version * removing hardcoded values for mdm * update docker version * telemetry for windows cadvisor timeouts * exeception key update to computer * PR comments * adding os to container inventory for windows nodes (#210) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors (#211) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors * Fixing the bug, deferring telemetry changes for later * updating to lowercase compare for units (#212) * Merge from vishwa/telegraftcp to ci_feature for telegraf changes (#214) * merge from Vishwa/telegraf to Vishwa/telegraftcp for telegraf changes (#207) * add configuration for telegraf * fix for perms * fix telegraf config. * fix file location & config * update to config * fix namespace * trying different namespace and also debug=true * add placeholder for nodename * change namespace * updated config * fix uri * fix azMon settings * remove aad settings * add custom metrics regions * fix config * add support for replica-set config * fix oomkilled * Add telegraf 403 metric telemetry & non 403 trace telemetry * fix type * fix package * fix package import * fix filename * delete unused file * conf file for rs; fix 403counttotal metric for telegraf, remove host and use nodeName consistently, rename metrics * fix statefulsets * fix typo. * fix another typo. * fix telemetry * fix casing issue * fix comma issue. * disable telemetry for rs ; fix stateful set name * worksround for namespace fix * telegraf integration - v1 * telemetry changes for telegraf * telemetry & other changes * remove custom metric regions as we dont need anymore * remove un-needed files * fixes * exclude certain volumes and fix telemetry to not have computer & nodename as dimensions (redundant) * Vishwa/resourcecentric (#208) (#209) * resourceid fix (for AKS only) * fix name * near final metric shape * change from customlog to fixed type (InsightsMetrics) * fix PR feedback * fix pr feedback * Fix telemetry error for telegraf err count metric (#215) * Fix Unscheduled Pod bug, remove excess telemetry (#218) * Fix Unscheduled Pod bug, remove excess telemetry * Send Success Telemetry only once after startup for a node in a cluster for MDM Post * Sending telemetry for successful push to MDM every hour * Merge from Vishwa/promstandardmetrics into ci_feature (#220) * enable prometheus metrics collection in replica-set * fixing typos * fix config file path for replicaset * fix configuration * config changes * merge config/settings to ci_feature (#221) * updating fluentbit to use LOG_TAIL_PATH * changes * log exclusion pattern * changes * removing comments * adding enviornment varibale collection/disable * disable env var for cluster variable change * changes * toml parser changes * adding directory tomlrb * changes for container inventory * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Telemetry for config overrides * add schema version telemetry * reduce the number of api calls for namespace filtering add more telemetry for config processing move liveness probe & parser to this repo * optimize for default kube-system namespace log collection exclusion * Fix Scenario when Controller name is empty (#222) * fix ; * ContainerLog collection optimizations (#223) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * merge final changes for release from Vishwa/june2019agentrel to ci_feature (#224) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * fix a minor comment * * change flush from 5 to 10 secs based on perf findings * fix fluent bit tuning for perf run (#226) * fix fluent bit tuning for perf run * stop collecting our own partition * fix merge issue * add release notes for june release in ci_feature branch * fix title * update * fix title * Trim spaces in AKS_REGION (#233) This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Add Logs Size To Telemetry (#234) * Add Logs to telemetry * Using len instead of unsafe.Sizeof * Merge Vishwa/promcustommetrics to ci_feature (#237) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * Fix Region space error (#239) * Trim spaces in AKS_REGION This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Fix out_mdm parsing error * Removing buffer chunk size and buffer max size from fluentbit conf (#240) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * removing buffer chunk size and buffer max size from fluentbit conf * changes (#243) * Collect container last state (#235) * updating the OMS agent to also collect container last state * changed a comment * git surrounded ContainerLastStatus code in a begin/rescue block * added a lot of error checking and logging * Rashmi/fix prom telemetry (#247) * fix prom telemetry * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Merge Health Model work into ci_feature behind a feature flag Pending perf testing (#246) Merge Health to ci_feature * Fix Deserialization Bug (#249) * Fix the bug where capacity is not updated and cached value was being used (#251) * Fix the Capacity computation * fix node cpu and memory limits calculation * changes (#250) * Added new Custom Metrics Regions, fixed MDM plugin crash bug (#253) Added new regions, added handler for MDM plugin start * Add Missing Handlers (#254) * Added Missing Handlers * Return MultiEventStream.new instead of empty array (#256) * Added explicit require_relative to avoid loading errors (#258) * Adding explicit require_relative * Gangams/enable ai telemetry in mc (#252) * enable ai telemetry to configure different ikey and endpoint per cloud * Fixing null check out_mdm bug, tomlparser bug, exposing Replica Set service name as an ENV variable (#261) * Expose replica set service as an env variable * Fixing null check out_mdm bug, and tomlparser bug * Updating the env variable name to be more specific to health model * Changes for creating custom plugins with namespace settings for prometheus scraping (#262) * changes * changes * changes * changes * changes * changes * chnages * changes * telemetry changes * changes * Cherry-pick hotfix 09092019 to ci_feature (#265) * Gangams/add telemetry hybrid (#264) * add telemetry to detect the cloud, distro and kernel version * add null check since providerId optional * detect azurestack cloud * rename to KubernetesProviderID since ProviderID name already used in LA * capture workspaceCloud to the telemetry * trim the domain read from file * KubeMonAgentEvents changes to collect configuration events (#267) * changes * changes * changes * changes * changes * changes * env changes * changes * changes * changes * reverting * changes * cahnges * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * chnages * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Fix the Dupe Perf Data Issue from the DaemonSet (#266) * Dupe Perf Record Fix * PR for 1. Container Memory CPU monitor 2. Configuration for Node Conditions 3. Fixed Type Changes 4. Use Env variable, and health_forward (that handles network errors at init) 5. Unit Tests (#268) * init containers fix and other bug fixes (#269) * init container - KPI and kubeperf changes * changes * changes * changes * changes for empty array fix * changes * changes * pod inventory exception fix * nil check changes * changes * fixing typo * changes * changes * PR - feedback * remove comment * tag pass changes * changes * tagdrop changes * changes * changes * Send agg monitor signal on details change (#270) send when an agg monitor details change, but state did not change * bug fixes for error (#274) * Fix to use declaration and assignment instead of assignment (#275) * bug fixes for error * adding declaration to assignment * 1. Added telemetry (#277) 2. Configuration property changes 3. Bug fixes for a. unscheduled pods returning green 3b. Sometimes, the details hash of agg monitors are different because the order of elements inside the array is different, causing the records to be sent * Bug fix to remove unused variable (#281) * bug fixes for error * adding declaration to assignment * removing unused variable * Fix the WARN<->WARNING typo (#282) * Bug Fixes 1. telemetry send throwing exception if records not initialized 2. permissions error in on-prem clusters (#284) * Bug fixes 1. not writeable, telemetry error * Change to state_WS_dir * Fix Require relative revert (#287) * Bug Fixes for exceptions in telemetry, remove limit set check (#289) * Bug Fixes 10222019 * Initialize container_cpu_memory_records in fhmb * Added telemetry to investigate health exceptions * Set frozen_string_literal to true * Send event once per container when lookup is empty, or limit is an array * Unit Tests, Use RS and POD to determine workload * Fixed Node Condition Bug, added exception handling to return get_rs_owner_ref * Fix the bug where if a warning condition appears before fail condition, the node condition is reported as warning instead of fail. Also fix the node conditions state to consider unknown as a failure state (#292) * Fix for Nodes Aspect not showing up in draft cluster (#294) * Fix the issue where the health tree is inconsistent if a deployment is deleted (#295) * Rashmi/1 16 test (#297) * health deployment update * apps v1 changes for deployment * changes * changes to use relicasets and api groups * Fix duplicate records in container memory/cpu samples (#298) * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * fix exceptions (#306) * Merge Branch morgan into ci_feature (#308) * Fixes : 1) Disable health (for time being) - in DS & RS 2) Disable MDM (for time being) - in DS & RS 3) Merge kubeperf into kubenode & kubepod 4) Made scheduling predictable for kubenode & kubepod 5) Enable containerlog enrichment fields (timeofcommand, containername & containerimage) as a configurable setting (default = true/ON) - Also add telemetry for it 6) Filter OUT type!=Normal events for k8s events 7) AppInsights telemetry async 8) Fix double calling bug in in_win_cadvisor_perf 9) Add connect timeout (20secs) & read timeout (40 secs) for all cadvisor api calls & also for all kubernetes api server calls 10) Fix batchTime for kubepods to be one before making api server call (rather than after making the call, which will make it fluctuate based on api server latency for the call) * fix setting issue for the new enrichcontainerlog setting * fix compilation issue * fix another compilation issue * fix emit issues * fix a nil issue * fix mising tag * * Fix all input plugins for scheduling issue * Merge kubeservices with kubepodinventory (reduce RS to API server by one more) * Remove Kubelogs (not used) * Fix liveness probe * Disable enrichment by default for container logs * Move to yajl json parser across the board for docker provier code * Remove unused files * fix removed files * fix timeofcommand and remove a duplicate entry for a health file. * Rashmi/http leak fixes (#301) * changes for http connection close * close socket in ensure * adding nil check * Rashmi/http leak fixes (#303) * changes for http connection close * close socket in ensure * adding nil check * adding missing end * use yajl for events & nodes parsing. * Rashmi/http leak fixes (#304) * changes for http connection close * close socket in ensure * adding nil check * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * adding missing end * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * changes for chunking * telemetry changes * some fixes * bug fix * changing to have morgan changes only * add new line * use polltime for metrics and disable out_forward for health * enable mdm & health * few optimizations * do not remove time of command make kube.conf same as scale tested config * remove comments from container.conf * remove flush comment for ai telemetry * remove commented code lines * fix config * remove timeofcommand when enrichment==false * fix config * enable mdm filter * Rashmi/api chunk (#307) * changes * changes * refactor changes * changes * changes * changes * changes * node changes * changes * changes * changes * changes * adding open and read timeouts for api client * removing comments * updating chunk size * Update Readme * add back timeofcommand (#310) * update readme for timeofcommand fix (#314) * Merge from ci_feature_prod into ci_feature (fix put back timeofcommand) (#311) (#316) * Updatng release history * fixing the plugin logs for emit stream * updating log message * Remove Log Processing from fluentd configuration * Remove plugin references from base_container.data * Dilipr/fluent bit log processing (#126) * Build out_oms.so and include in docker-cimprov package * Adding fluent-bit-config file to base container * PR Feedback * Adding out_oms.conf to base_container.data * PR Feedback * Making the critical section as small as possible * PR Feedback * Fixing the newline bug for Computer, and changing containerId to Id * Dilipr/glide updates (#127) * Updating glide.* files to include lumberjack * containerID="" for pull issues * Using KubeAPI for getting image,name. Adding more logs (#129) * Using KubeAPI for getting image,name. Adding more logs * Moving log file and state file to within the omsagent container * Changing log and state paths * Dilipr/mark comments (#130) * Marks Comments + Error Handling * Drop records from files that are not in k8s format * Remove unnecessary log line' * Adding Log to the file that doesn't conform to the expected format * Rashmi/segfault latest (#132) * adding null checks in all providers * fixing type * fixing type * adding more null checks * update cjson * Adding a missed null check (#135) * reusing some variables (#136) * Rashmi/cjson delete null check (#138) * adding null check for cjson-delete * null chk * removing null check * updating log level to debug for some provider workflows (#139) * Fixing CPU Utilization and removing Fluent-bit filters (#140) Removing fluent-bit filters, CPU optimizations * Minor tweaks 1. Remove some logging 2. Added more Error Handling 3. Continue when there is an error with k8s api (#141) * Removing some logs, added more error checking, continue on kube-api error * Return FLB OK for json Marshall error, instead of RETRY * * Change FluentBit flush interval to 30 secs (from 5 secs) * Remove ContainerPerf, ContainerServiceLog,ContainerProcess (OMI workflows) for Daemonset * Container Log Telemetry * Fixing an issue with Send Init Event if Telemetry is not initialized properly, tab to whitespace in conf file * PR feedback * PR feedback * Sending an event every 5 mins(Heartbeat) (#146) * PR feedback to cleanup removed workflows * updating agent version for telemetry * updating agent version * Telemetry Updates (#149) * Telemetry Fixes 1. Added Log Generation Rate 2. Fixed parsing bugs 3. Added code to send Exceptions/errors * PR Feedback * Changes to send omsagent/omsagent-rs kubectl logs to App Insights (#159) * Changes to send omsagent/omsagent-rs kubectl logs to App Insights * PR Feedback * Rashmi/fluentd docker inventory (#160) * first stab * changes * changes * docker util changes * working tested util * input plugin and conf * changes * changes * changes * changes * changes * working containerinventory * fixing omi removal from container.conf * removing comments * file write and read * deleted containers working * changes * changes * socket timeout * deleting test files * adding log * fixing comment * appinsights changes * changes * tel changes * changes * changes * changes * changes * lib changes * changes * changes * fixes * PR comments * changes * updating the ownership * changes * changes * changes to container data * removing comment * changes * adding collection time * bug fix * env string truncation * changes for acs-engine test * Fix Telemetry Bug -- Initialize Telemetry Client after Initializing all required properties (#162) * Fix kube events memory leak due to yaml serialization for > 5k events (#163) * Setting Timeout for HTTP Client in PostDataHelper in outoms go plugin(#164) * Vishwa/perftelemetry 2 (#165) * add cpu usage telemetry for ds & rs * add cpu & memory usage telemetry for ds & rs * environment variable fix (#166) * environment variable fix * updating agent version * Fixing a bug where we were crashing due to container statuses not present when not was lost (#167) * Updating title * updating right versions for last release * Updating the break condition to look for end of response (#168) * Updating the break condition to look for end of response * changes for docker response * updating AgentVersion for telemetry * Updating readme for latest release changes * Changes - (#173) * use /var/log for state * new metric ContainerLogsAgentSideLatencyMs * new field 'timeOfComand' * Rashmi/kubenodeinventory (#174) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * Get cpuusage from usageseconds (#175) * Rashmi/kubenodeinventory (#176) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * Rashmi/kubenodeinventory (#178) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * Fixing an issue on the cpurate metric, which happens for the first time (when cache is empty) (#179) * Rashmi/kubenodeinventory (#180) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Exclude docker containers from container inventory (#181) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * Exclude pauseamd64 containers from container inventory (#182) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * Update agent version * Updating readme for the latest release * Fix indentation in kube.conf and update readme (#184) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * fixing indentation so that kube.conf contents can be used in config map in the yaml * updating readme to fix date and agent version * updating agent tag * Get Pods for current Node Only (#185) * Fix KubeAPI Calls to filter to get pods for current node * Reinstate log line * changes for container node inventory fixed type (#186) * Fix for mooncake (disable telemetry optionally) (#191) * disable telemetry option * fix a typo * CustomMetrics to ci_feature (#193) Custom Metrics changes to ci_feature * add ContainerNotRunning column to KubePodInventory * merge pr feedback: update name to ContainerStatusReason * Zero Fill for Missing Pod Phases, Change Namespace Dimension to Kubernetes namespace, as it might be confused with metrics namespace in Metrics Explorer (#194) * Zero Fill for Pod Counts by Phase * Change namespace dimension to Kubernetes namespace * No Retries for non 404 4xx errors (#196) * Update agent version for telemetry * Update readme for upcoming (ciprod01202019) release * fix readme formatting * fix formatting for readme * fix formatting for readme * fix readme * fix readme * fix agent version for telemetry * fix date in readme * update readme * Restart logs every 10MB instead of weekly (#198) * Rotate logs every 10MB instead of weekly * Removing some logging, fixed log rotation * update agent version for telemetry * update readme * Update kube.conf to use %STATE_DIR_WS% instead of hardcoded path * Fix AKSEngine Crash (#200) * hotfix * close resp.Body * remove chatty logs * membuf=5m and ignore files not updated since 5 mins * fix readme for new version * Fix the pod count in mdm agent plugin (#203) * Update readme * string freeze for out_mdm plugin * Vishwa/resourcecentric (#208) * resourceid fix (for AKS only) * fix name * Rashmi/win nodepool - PR (#206) * changes for win nodes enumeration * changes * changes * changes * node cpu metric rate changes * container cpu rate * changes * changes * changes * changes * changes * changes to include in_win_cadvisor_perf.rb file * send containerinventoryheartbeatevent * changes * cahnges for mdm metrics * changes * cahnges * changes * container states * changes * changes * changes for env variables * changes * changes * changes * changes * delete comments * changes * mutex changes * changes * changes * changes * telemetry fix for docker version * removing hardcoded values for mdm * update docker version * telemetry for windows cadvisor timeouts * exeception key update to computer * PR comments * adding os to container inventory for windows nodes (#210) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors (#211) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors * Fixing the bug, deferring telemetry changes for later * updating to lowercase compare for units (#212) * Merge from vishwa/telegraftcp to ci_feature for telegraf changes (#214) * merge from Vishwa/telegraf to Vishwa/telegraftcp for telegraf changes (#207) * add configuration for telegraf * fix for perms * fix telegraf config. * fix file location & config * update to config * fix namespace * trying different namespace and also debug=true * add placeholder for nodename * change namespace * updated config * fix uri * fix azMon settings * remove aad settings * add custom metrics regions * fix config * add support for replica-set config * fix oomkilled * Add telegraf 403 metric telemetry & non 403 trace telemetry * fix type * fix package * fix package import * fix filename * delete unused file * conf file for rs; fix 403counttotal metric for telegraf, remove host and use nodeName consistently, rename metrics * fix statefulsets * fix typo. * fix another typo. * fix telemetry * fix casing issue * fix comma issue. * disable telemetry for rs ; fix stateful set name * worksround for namespace fix * telegraf integration - v1 * telemetry changes for telegraf * telemetry & other changes * remove custom metric regions as we dont need anymore * remove un-needed files * fixes * exclude certain volumes and fix telemetry to not have computer & nodename as dimensions (redundant) * Vishwa/resourcecentric (#208) (#209) * resourceid fix (for AKS only) * fix name * near final metric shape * change from customlog to fixed type (InsightsMetrics) * fix PR feedback * fix pr feedback * Fix telemetry error for telegraf err count metric (#215) * Fix Unscheduled Pod bug, remove excess telemetry (#218) * Fix Unscheduled Pod bug, remove excess telemetry * Send Success Telemetry only once after startup for a node in a cluster for MDM Post * Sending telemetry for successful push to MDM every hour * Merge from Vishwa/promstandardmetrics into ci_feature (#220) * enable prometheus metrics collection in replica-set * fixing typos * fix config file path for replicaset * fix configuration * config changes * merge config/settings to ci_feature (#221) * updating fluentbit to use LOG_TAIL_PATH * changes * log exclusion pattern * changes * removing comments * adding enviornment varibale collection/disable * disable env var for cluster variable change * changes * toml parser changes * adding directory tomlrb * changes for container inventory * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Telemetry for config overrides * add schema version telemetry * reduce the number of api calls for namespace filtering add more telemetry for config processing move liveness probe & parser to this repo * optimize for default kube-system namespace log collection exclusion * Fix Scenario when Controller name is empty (#222) * fix ; * ContainerLog collection optimizations (#223) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * merge final changes for release from Vishwa/june2019agentrel to ci_feature (#224) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * fix a minor comment * * change flush from 5 to 10 secs based on perf findings * fix fluent bit tuning for perf run (#226) * fix fluent bit tuning for perf run * stop collecting our own partition * fix merge issue * add release notes for june release in ci_feature branch * fix title * update * fix title * Trim spaces in AKS_REGION (#233) This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Add Logs Size To Telemetry (#234) * Add Logs to telemetry * Using len instead of unsafe.Sizeof * Merge Vishwa/promcustommetrics to ci_feature (#237) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * Fix Region space error (#239) * Trim spaces in AKS_REGION This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Fix out_mdm parsing error * Removing buffer chunk size and buffer max size from fluentbit conf (#240) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * removing buffer chunk size and buffer max size from fluentbit conf * changes (#243) * Collect container last state (#235) * updating the OMS agent to also collect container last state * changed a comment * git surrounded ContainerLastStatus code in a begin/rescue block * added a lot of error checking and logging * Rashmi/fix prom telemetry (#247) * fix prom telemetry * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Merge Health Model work into ci_feature behind a feature flag Pending perf testing (#246) Merge Health to ci_feature * Fix Deserialization Bug (#249) * Fix the bug where capacity is not updated and cached value was being used (#251) * Fix the Capacity computation * fix node cpu and memory limits calculation * changes (#250) * Added new Custom Metrics Regions, fixed MDM plugin crash bug (#253) Added new regions, added handler for MDM plugin start * Add Missing Handlers (#254) * Added Missing Handlers * Return MultiEventStream.new instead of empty array (#256) * Added explicit require_relative to avoid loading errors (#258) * Adding explicit require_relative * Gangams/enable ai telemetry in mc (#252) * enable ai telemetry to configure different ikey and endpoint per cloud * Fixing null check out_mdm bug, tomlparser bug, exposing Replica Set service name as an ENV variable (#261) * Expose replica set service as an env variable * Fixing null check out_mdm bug, and tomlparser bug * Updating the env variable name to be more specific to health model * Changes for creating custom plugins with namespace settings for prometheus scraping (#262) * changes * changes * changes * changes * changes * changes * chnages * changes * telemetry changes * changes * Cherry-pick hotfix 09092019 to ci_feature (#265) * Gangams/add telemetry hybrid (#264) * add telemetry to detect the cloud, distro and kernel version * add null check since providerId optional * detect azurestack cloud * rename to KubernetesProviderID since ProviderID name already used in LA * capture workspaceCloud to the telemetry * trim the domain read from file * KubeMonAgentEvents changes to collect configuration events (#267) * changes * changes * changes * changes * changes * changes * env changes * changes * changes * changes * reverting * changes * cahnges * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * chnages * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Fix the Dupe Perf Data Issue from the DaemonSet (#266) * Dupe Perf Record Fix * PR for 1. Container Memory CPU monitor 2. Configuration for Node Conditions 3. Fixed Type Changes 4. Use Env variable, and health_forward (that handles network errors at init) 5. Unit Tests (#268) * init containers fix and other bug fixes (#269) * init container - KPI and kubeperf changes * changes * changes * changes * changes for empty array fix * changes * changes * pod inventory exception fix * nil check changes * changes * fixing typo * changes * changes * PR - feedback * remove comment * tag pass changes * changes * tagdrop changes * changes * changes * Send agg monitor signal on details change (#270) send when an agg monitor details change, but state did not change * bug fixes for error (#274) * Fix to use declaration and assignment instead of assignment (#275) * bug fixes for error * adding declaration to assignment * 1. Added telemetry (#277) 2. Configuration property changes 3. Bug fixes for a. unscheduled pods returning green 3b. Sometimes, the details hash of agg monitors are different because the order of elements inside the array is different, causing the records to be sent * Bug fix to remove unused variable (#281) * bug fixes for error * adding declaration to assignment * removing unused variable * Fix the WARN<->WARNING typo (#282) * Bug Fixes 1. telemetry send throwing exception if records not initialized 2. permissions error in on-prem clusters (#284) * Bug fixes 1. not writeable, telemetry error * Change to state_WS_dir * Fix Require relative revert (#287) * Bug Fixes for exceptions in telemetry, remove limit set check (#289) * Bug Fixes 10222019 * Initialize container_cpu_memory_records in fhmb * Added telemetry to investigate health exceptions * Set frozen_string_literal to true * Send event once per container when lookup is empty, or limit is an array * Unit Tests, Use RS and POD to determine workload * Fixed Node Condition Bug, added exception handling to return get_rs_owner_ref * Fix the bug where if a warning condition appears before fail condition, the node condition is reported as warning instead of fail. Also fix the node conditions state to consider unknown as a failure state (#292) * Fix for Nodes Aspect not showing up in draft cluster (#294) * Fix the issue where the health tree is inconsistent if a deployment is deleted (#295) * Rashmi/1 16 test (#297) * health deployment update * apps v1 changes for deployment * changes * changes to use relicasets and api groups * Fix duplicate records in container memory/cpu samples (#298) * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * fix exceptions (#306) * Merge Branch morgan into ci_feature (#308) * Fixes : 1) Disable health (for time being) - in DS & RS 2) Disable MDM (for time being) - in DS & RS 3) Merge kubeperf into kubenode & kubepod 4) Made scheduling predictable for kubenode & kubepod 5) Enable containerlog enrichment fields (timeofcommand, containername & containerimage) as a configurable setting (default = true/ON) - Also add telemetry for it 6) Filter OUT type!=Normal events for k8s events 7) AppInsights telemetry async 8) Fix double calling bug in in_win_cadvisor_perf 9) Add connect timeout (20secs) & read timeout (40 secs) for all cadvisor api calls & also for all kubernetes api server calls 10) Fix batchTime for kubepods to be one before making api server call (rather than after making the call, which will make it fluctuate based on api server latency for the call) * fix setting issue for the new enrichcontainerlog setting * fix compilation issue * fix another compilation issue * fix emit issues * fix a nil issue * fix mising tag * * Fix all input plugins for scheduling issue * Merge kubeservices with kubepodinventory (reduce RS to API server by one more) * Remove Kubelogs (not used) * Fix liveness probe * Disable enrichment by default for container logs * Move to yajl json parser across the board for docker provier code * Remove unused files * fix removed files * fix timeofcommand and remove a duplicate entry for a health file. * Rashmi/http leak fixes (#301) * changes for http connection close * close socket in ensure * adding nil check * Rashmi/http leak fixes (#303) * changes for http connection close * close socket in ensure * adding nil check * adding missing end * use yajl for events & nodes parsing. * Rashmi/http leak fixes (#304) * changes for http connection close * close socket in ensure * adding nil check * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * adding missing end * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * changes for chunking * telemetry changes * some fixes * bug fix * changing to have morgan changes only * add new line * use polltime for metrics and disable out_forward for health * enable mdm & health * few optimizations * do not remove time of command make kube.conf same as scale tested config * remove comments from container.conf * remove flush comment for ai telemetry * remove commented code lines * fix config * remove timeofcommand when enrichment==false * fix config * enable mdm filter * Rashmi/api chunk (#307) * changes * changes * refactor changes * changes * changes * changes * changes * node changes * changes * changes * changes * changes * adding open and read timeouts for api client * removing comments * updating chunk size * Update Readme * add back timeofcommand (#310) * Adding new cpu and memory limits to readme * CAdvisor to use 10255/10250 based on env variable (#321) * CAdvisor secure port changes (#320) * cadvsior secure port changes * update to use secure/insecure port for cadvisor * telemetry changes * fix bug * bug fix * changes * Adding cadvisor uri log * switching defaults * update readme * changes Co-authored-by: Vishwanath <visnara@microsoft.com> Co-authored-by: Dilip Raghunathan <dilip.rangarajan@gmail.com> Co-authored-by: bragi92 <kadubey@microsoft.com> Co-authored-by: David Michelman <daweim0@gmail.com> Co-authored-by: ganga1980 <gangams@microsoft.com>
jatakiajanvi12
pushed a commit
that referenced
this pull request
Dec 2, 2022
* Updatng release history * fixing the plugin logs for emit stream * updating log message * Remove Log Processing from fluentd configuration * Remove plugin references from base_container.data * Dilipr/fluent bit log processing (#126) * Build out_oms.so and include in docker-cimprov package * Adding fluent-bit-config file to base container * PR Feedback * Adding out_oms.conf to base_container.data * PR Feedback * Making the critical section as small as possible * PR Feedback * Fixing the newline bug for Computer, and changing containerId to Id * Dilipr/glide updates (#127) * Updating glide.* files to include lumberjack * containerID="" for pull issues * Using KubeAPI for getting image,name. Adding more logs (#129) * Using KubeAPI for getting image,name. Adding more logs * Moving log file and state file to within the omsagent container * Changing log and state paths * Dilipr/mark comments (#130) * Marks Comments + Error Handling * Drop records from files that are not in k8s format * Remove unnecessary log line' * Adding Log to the file that doesn't conform to the expected format * Rashmi/segfault latest (#132) * adding null checks in all providers * fixing type * fixing type * adding more null checks * update cjson * Adding a missed null check (#135) * reusing some variables (#136) * Rashmi/cjson delete null check (#138) * adding null check for cjson-delete * null chk * removing null check * updating log level to debug for some provider workflows (#139) * Fixing CPU Utilization and removing Fluent-bit filters (#140) Removing fluent-bit filters, CPU optimizations * Minor tweaks 1. Remove some logging 2. Added more Error Handling 3. Continue when there is an error with k8s api (#141) * Removing some logs, added more error checking, continue on kube-api error * Return FLB OK for json Marshall error, instead of RETRY * * Change FluentBit flush interval to 30 secs (from 5 secs) * Remove ContainerPerf, ContainerServiceLog,ContainerProcess (OMI workflows) for Daemonset * Container Log Telemetry * Fixing an issue with Send Init Event if Telemetry is not initialized properly, tab to whitespace in conf file * PR feedback * PR feedback * Sending an event every 5 mins(Heartbeat) (#146) * PR feedback to cleanup removed workflows * updating agent version for telemetry * updating agent version * Telemetry Updates (#149) * Telemetry Fixes 1. Added Log Generation Rate 2. Fixed parsing bugs 3. Added code to send Exceptions/errors * PR Feedback * Changes to send omsagent/omsagent-rs kubectl logs to App Insights (#159) * Changes to send omsagent/omsagent-rs kubectl logs to App Insights * PR Feedback * Rashmi/fluentd docker inventory (#160) * first stab * changes * changes * docker util changes * working tested util * input plugin and conf * changes * changes * changes * changes * changes * working containerinventory * fixing omi removal from container.conf * removing comments * file write and read * deleted containers working * changes * changes * socket timeout * deleting test files * adding log * fixing comment * appinsights changes * changes * tel changes * changes * changes * changes * changes * lib changes * changes * changes * fixes * PR comments * changes * updating the ownership * changes * changes * changes to container data * removing comment * changes * adding collection time * bug fix * env string truncation * changes for acs-engine test * Fix Telemetry Bug -- Initialize Telemetry Client after Initializing all required properties (#162) * Fix kube events memory leak due to yaml serialization for > 5k events (#163) * Setting Timeout for HTTP Client in PostDataHelper in outoms go plugin(#164) * Vishwa/perftelemetry 2 (#165) * add cpu usage telemetry for ds & rs * add cpu & memory usage telemetry for ds & rs * environment variable fix (#166) * environment variable fix * updating agent version * Fixing a bug where we were crashing due to container statuses not present when not was lost (#167) * Updating title * updating right versions for last release * Updating the break condition to look for end of response (#168) * Updating the break condition to look for end of response * changes for docker response * updating AgentVersion for telemetry * Updating readme for latest release changes * Changes - (#173) * use /var/log for state * new metric ContainerLogsAgentSideLatencyMs * new field 'timeOfComand' * Rashmi/kubenodeinventory (#174) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * Get cpuusage from usageseconds (#175) * Rashmi/kubenodeinventory (#176) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * Rashmi/kubenodeinventory (#178) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * Fixing an issue on the cpurate metric, which happens for the first time (when cache is empty) (#179) * Rashmi/kubenodeinventory (#180) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Exclude docker containers from container inventory (#181) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * Exclude pauseamd64 containers from container inventory (#182) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * Update agent version * Updating readme for the latest release * Fix indentation in kube.conf and update readme (#184) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * fixing indentation so that kube.conf contents can be used in config map in the yaml * updating readme to fix date and agent version * updating agent tag * Get Pods for current Node Only (#185) * Fix KubeAPI Calls to filter to get pods for current node * Reinstate log line * changes for container node inventory fixed type (#186) * Fix for mooncake (disable telemetry optionally) (#191) * disable telemetry option * fix a typo * CustomMetrics to ci_feature (#193) Custom Metrics changes to ci_feature * add ContainerNotRunning column to KubePodInventory * merge pr feedback: update name to ContainerStatusReason * Zero Fill for Missing Pod Phases, Change Namespace Dimension to Kubernetes namespace, as it might be confused with metrics namespace in Metrics Explorer (#194) * Zero Fill for Pod Counts by Phase * Change namespace dimension to Kubernetes namespace * No Retries for non 404 4xx errors (#196) * Update agent version for telemetry * Update readme for upcoming (ciprod01202019) release * fix readme formatting * fix formatting for readme * fix formatting for readme * fix readme * fix readme * fix agent version for telemetry * fix date in readme * update readme * Restart logs every 10MB instead of weekly (#198) * Rotate logs every 10MB instead of weekly * Removing some logging, fixed log rotation * update agent version for telemetry * update readme * Update kube.conf to use %STATE_DIR_WS% instead of hardcoded path * Fix AKSEngine Crash (#200) * hotfix * close resp.Body * remove chatty logs * membuf=5m and ignore files not updated since 5 mins * fix readme for new version * Fix the pod count in mdm agent plugin (#203) * Update readme * string freeze for out_mdm plugin * Vishwa/resourcecentric (#208) * resourceid fix (for AKS only) * fix name * Rashmi/win nodepool - PR (#206) * changes for win nodes enumeration * changes * changes * changes * node cpu metric rate changes * container cpu rate * changes * changes * changes * changes * changes * changes to include in_win_cadvisor_perf.rb file * send containerinventoryheartbeatevent * changes * cahnges for mdm metrics * changes * cahnges * changes * container states * changes * changes * changes for env variables * changes * changes * changes * changes * delete comments * changes * mutex changes * changes * changes * changes * telemetry fix for docker version * removing hardcoded values for mdm * update docker version * telemetry for windows cadvisor timeouts * exeception key update to computer * PR comments * adding os to container inventory for windows nodes (#210) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors (#211) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors * Fixing the bug, deferring telemetry changes for later * updating to lowercase compare for units (#212) * Merge from vishwa/telegraftcp to ci_feature for telegraf changes (#214) * merge from Vishwa/telegraf to Vishwa/telegraftcp for telegraf changes (#207) * add configuration for telegraf * fix for perms * fix telegraf config. * fix file location & config * update to config * fix namespace * trying different namespace and also debug=true * add placeholder for nodename * change namespace * updated config * fix uri * fix azMon settings * remove aad settings * add custom metrics regions * fix config * add support for replica-set config * fix oomkilled * Add telegraf 403 metric telemetry & non 403 trace telemetry * fix type * fix package * fix package import * fix filename * delete unused file * conf file for rs; fix 403counttotal metric for telegraf, remove host and use nodeName consistently, rename metrics * fix statefulsets * fix typo. * fix another typo. * fix telemetry * fix casing issue * fix comma issue. * disable telemetry for rs ; fix stateful set name * worksround for namespace fix * telegraf integration - v1 * telemetry changes for telegraf * telemetry & other changes * remove custom metric regions as we dont need anymore * remove un-needed files * fixes * exclude certain volumes and fix telemetry to not have computer & nodename as dimensions (redundant) * Vishwa/resourcecentric (#208) (#209) * resourceid fix (for AKS only) * fix name * near final metric shape * change from customlog to fixed type (InsightsMetrics) * fix PR feedback * fix pr feedback * Fix telemetry error for telegraf err count metric (#215) * Fix Unscheduled Pod bug, remove excess telemetry (#218) * Fix Unscheduled Pod bug, remove excess telemetry * Send Success Telemetry only once after startup for a node in a cluster for MDM Post * Sending telemetry for successful push to MDM every hour * Merge from Vishwa/promstandardmetrics into ci_feature (#220) * enable prometheus metrics collection in replica-set * fixing typos * fix config file path for replicaset * fix configuration * config changes * merge config/settings to ci_feature (#221) * updating fluentbit to use LOG_TAIL_PATH * changes * log exclusion pattern * changes * removing comments * adding enviornment varibale collection/disable * disable env var for cluster variable change * changes * toml parser changes * adding directory tomlrb * changes for container inventory * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Telemetry for config overrides * add schema version telemetry * reduce the number of api calls for namespace filtering add more telemetry for config processing move liveness probe & parser to this repo * optimize for default kube-system namespace log collection exclusion * Fix Scenario when Controller name is empty (#222) * fix ; * ContainerLog collection optimizations (#223) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * merge final changes for release from Vishwa/june2019agentrel to ci_feature (#224) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * fix a minor comment * * change flush from 5 to 10 secs based on perf findings * fix fluent bit tuning for perf run (#226) * fix fluent bit tuning for perf run * stop collecting our own partition * fix merge issue * add release notes for june release in ci_feature branch * fix title * update * fix title * Trim spaces in AKS_REGION (#233) This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Add Logs Size To Telemetry (#234) * Add Logs to telemetry * Using len instead of unsafe.Sizeof * Merge Vishwa/promcustommetrics to ci_feature (#237) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * Fix Region space error (#239) * Trim spaces in AKS_REGION This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Fix out_mdm parsing error * Removing buffer chunk size and buffer max size from fluentbit conf (#240) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * removing buffer chunk size and buffer max size from fluentbit conf * changes (#243) * Collect container last state (#235) * updating the OMS agent to also collect container last state * changed a comment * git surrounded ContainerLastStatus code in a begin/rescue block * added a lot of error checking and logging * Rashmi/fix prom telemetry (#247) * fix prom telemetry * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Merge Health Model work into ci_feature behind a feature flag Pending perf testing (#246) Merge Health to ci_feature * Fix Deserialization Bug (#249) * Fix the bug where capacity is not updated and cached value was being used (#251) * Fix the Capacity computation * fix node cpu and memory limits calculation * changes (#250) * Added new Custom Metrics Regions, fixed MDM plugin crash bug (#253) Added new regions, added handler for MDM plugin start * Add Missing Handlers (#254) * Added Missing Handlers * Return MultiEventStream.new instead of empty array (#256) * Added explicit require_relative to avoid loading errors (#258) * Adding explicit require_relative * Gangams/enable ai telemetry in mc (#252) * enable ai telemetry to configure different ikey and endpoint per cloud * Fixing null check out_mdm bug, tomlparser bug, exposing Replica Set service name as an ENV variable (#261) * Expose replica set service as an env variable * Fixing null check out_mdm bug, and tomlparser bug * Updating the env variable name to be more specific to health model * Changes for creating custom plugins with namespace settings for prometheus scraping (#262) * changes * changes * changes * changes * changes * changes * chnages * changes * telemetry changes * changes * Cherry-pick hotfix 09092019 to ci_feature (#265) * Gangams/add telemetry hybrid (#264) * add telemetry to detect the cloud, distro and kernel version * add null check since providerId optional * detect azurestack cloud * rename to KubernetesProviderID since ProviderID name already used in LA * capture workspaceCloud to the telemetry * trim the domain read from file * KubeMonAgentEvents changes to collect configuration events (#267) * changes * changes * changes * changes * changes * changes * env changes * changes * changes * changes * reverting * changes * cahnges * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * chnages * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Fix the Dupe Perf Data Issue from the DaemonSet (#266) * Dupe Perf Record Fix * PR for 1. Container Memory CPU monitor 2. Configuration for Node Conditions 3. Fixed Type Changes 4. Use Env variable, and health_forward (that handles network errors at init) 5. Unit Tests (#268) * init containers fix and other bug fixes (#269) * init container - KPI and kubeperf changes * changes * changes * changes * changes for empty array fix * changes * changes * pod inventory exception fix * nil check changes * changes * fixing typo * changes * changes * PR - feedback * remove comment * tag pass changes * changes * tagdrop changes * changes * changes * Send agg monitor signal on details change (#270) send when an agg monitor details change, but state did not change * bug fixes for error (#274) * Fix to use declaration and assignment instead of assignment (#275) * bug fixes for error * adding declaration to assignment * 1. Added telemetry (#277) 2. Configuration property changes 3. Bug fixes for a. unscheduled pods returning green 3b. Sometimes, the details hash of agg monitors are different because the order of elements inside the array is different, causing the records to be sent * Bug fix to remove unused variable (#281) * bug fixes for error * adding declaration to assignment * removing unused variable * Fix the WARN<->WARNING typo (#282) * Bug Fixes 1. telemetry send throwing exception if records not initialized 2. permissions error in on-prem clusters (#284) * Bug fixes 1. not writeable, telemetry error * Change to state_WS_dir * Fix Require relative revert (#287) * Bug Fixes for exceptions in telemetry, remove limit set check (#289) * Bug Fixes 10222019 * Initialize container_cpu_memory_records in fhmb * Added telemetry to investigate health exceptions * Set frozen_string_literal to true * Send event once per container when lookup is empty, or limit is an array * Unit Tests, Use RS and POD to determine workload * Fixed Node Condition Bug, added exception handling to return get_rs_owner_ref * Fix the bug where if a warning condition appears before fail condition, the node condition is reported as warning instead of fail. Also fix the node conditions state to consider unknown as a failure state (#292) * Fix for Nodes Aspect not showing up in draft cluster (#294) * Fix the issue where the health tree is inconsistent if a deployment is deleted (#295) * Rashmi/1 16 test (#297) * health deployment update * apps v1 changes for deployment * changes * changes to use relicasets and api groups * Fix duplicate records in container memory/cpu samples (#298) * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * fix exceptions (#306) * Merge Branch morgan into ci_feature (#308) * Fixes : 1) Disable health (for time being) - in DS & RS 2) Disable MDM (for time being) - in DS & RS 3) Merge kubeperf into kubenode & kubepod 4) Made scheduling predictable for kubenode & kubepod 5) Enable containerlog enrichment fields (timeofcommand, containername & containerimage) as a configurable setting (default = true/ON) - Also add telemetry for it 6) Filter OUT type!=Normal events for k8s events 7) AppInsights telemetry async 8) Fix double calling bug in in_win_cadvisor_perf 9) Add connect timeout (20secs) & read timeout (40 secs) for all cadvisor api calls & also for all kubernetes api server calls 10) Fix batchTime for kubepods to be one before making api server call (rather than after making the call, which will make it fluctuate based on api server latency for the call) * fix setting issue for the new enrichcontainerlog setting * fix compilation issue * fix another compilation issue * fix emit issues * fix a nil issue * fix mising tag * * Fix all input plugins for scheduling issue * Merge kubeservices with kubepodinventory (reduce RS to API server by one more) * Remove Kubelogs (not used) * Fix liveness probe * Disable enrichment by default for container logs * Move to yajl json parser across the board for docker provier code * Remove unused files * fix removed files * fix timeofcommand and remove a duplicate entry for a health file. * Rashmi/http leak fixes (#301) * changes for http connection close * close socket in ensure * adding nil check * Rashmi/http leak fixes (#303) * changes for http connection close * close socket in ensure * adding nil check * adding missing end * use yajl for events & nodes parsing. * Rashmi/http leak fixes (#304) * changes for http connection close * close socket in ensure * adding nil check * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * adding missing end * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * changes for chunking * telemetry changes * some fixes * bug fix * changing to have morgan changes only * add new line * use polltime for metrics and disable out_forward for health * enable mdm & health * few optimizations * do not remove time of command make kube.conf same as scale tested config * remove comments from container.conf * remove flush comment for ai telemetry * remove commented code lines * fix config * remove timeofcommand when enrichment==false * fix config * enable mdm filter * Rashmi/api chunk (#307) * changes * changes * refactor changes * changes * changes * changes * changes * node changes * changes * changes * changes * changes * adding open and read timeouts for api client * removing comments * updating chunk size * Update Readme * add back timeofcommand (#310) * update readme for timeofcommand fix (#314) * Merge from ci_feature_prod into ci_feature (fix put back timeofcommand) (#311) (#316) * Updatng release history * fixing the plugin logs for emit stream * updating log message * Remove Log Processing from fluentd configuration * Remove plugin references from base_container.data * Dilipr/fluent bit log processing (#126) * Build out_oms.so and include in docker-cimprov package * Adding fluent-bit-config file to base container * PR Feedback * Adding out_oms.conf to base_container.data * PR Feedback * Making the critical section as small as possible * PR Feedback * Fixing the newline bug for Computer, and changing containerId to Id * Dilipr/glide updates (#127) * Updating glide.* files to include lumberjack * containerID="" for pull issues * Using KubeAPI for getting image,name. Adding more logs (#129) * Using KubeAPI for getting image,name. Adding more logs * Moving log file and state file to within the omsagent container * Changing log and state paths * Dilipr/mark comments (#130) * Marks Comments + Error Handling * Drop records from files that are not in k8s format * Remove unnecessary log line' * Adding Log to the file that doesn't conform to the expected format * Rashmi/segfault latest (#132) * adding null checks in all providers * fixing type * fixing type * adding more null checks * update cjson * Adding a missed null check (#135) * reusing some variables (#136) * Rashmi/cjson delete null check (#138) * adding null check for cjson-delete * null chk * removing null check * updating log level to debug for some provider workflows (#139) * Fixing CPU Utilization and removing Fluent-bit filters (#140) Removing fluent-bit filters, CPU optimizations * Minor tweaks 1. Remove some logging 2. Added more Error Handling 3. Continue when there is an error with k8s api (#141) * Removing some logs, added more error checking, continue on kube-api error * Return FLB OK for json Marshall error, instead of RETRY * * Change FluentBit flush interval to 30 secs (from 5 secs) * Remove ContainerPerf, ContainerServiceLog,ContainerProcess (OMI workflows) for Daemonset * Container Log Telemetry * Fixing an issue with Send Init Event if Telemetry is not initialized properly, tab to whitespace in conf file * PR feedback * PR feedback * Sending an event every 5 mins(Heartbeat) (#146) * PR feedback to cleanup removed workflows * updating agent version for telemetry * updating agent version * Telemetry Updates (#149) * Telemetry Fixes 1. Added Log Generation Rate 2. Fixed parsing bugs 3. Added code to send Exceptions/errors * PR Feedback * Changes to send omsagent/omsagent-rs kubectl logs to App Insights (#159) * Changes to send omsagent/omsagent-rs kubectl logs to App Insights * PR Feedback * Rashmi/fluentd docker inventory (#160) * first stab * changes * changes * docker util changes * working tested util * input plugin and conf * changes * changes * changes * changes * changes * working containerinventory * fixing omi removal from container.conf * removing comments * file write and read * deleted containers working * changes * changes * socket timeout * deleting test files * adding log * fixing comment * appinsights changes * changes * tel changes * changes * changes * changes * changes * lib changes * changes * changes * fixes * PR comments * changes * updating the ownership * changes * changes * changes to container data * removing comment * changes * adding collection time * bug fix * env string truncation * changes for acs-engine test * Fix Telemetry Bug -- Initialize Telemetry Client after Initializing all required properties (#162) * Fix kube events memory leak due to yaml serialization for > 5k events (#163) * Setting Timeout for HTTP Client in PostDataHelper in outoms go plugin(#164) * Vishwa/perftelemetry 2 (#165) * add cpu usage telemetry for ds & rs * add cpu & memory usage telemetry for ds & rs * environment variable fix (#166) * environment variable fix * updating agent version * Fixing a bug where we were crashing due to container statuses not present when not was lost (#167) * Updating title * updating right versions for last release * Updating the break condition to look for end of response (#168) * Updating the break condition to look for end of response * changes for docker response * updating AgentVersion for telemetry * Updating readme for latest release changes * Changes - (#173) * use /var/log for state * new metric ContainerLogsAgentSideLatencyMs * new field 'timeOfComand' * Rashmi/kubenodeinventory (#174) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * Get cpuusage from usageseconds (#175) * Rashmi/kubenodeinventory (#176) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * Rashmi/kubenodeinventory (#178) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * Fixing an issue on the cpurate metric, which happens for the first time (when cache is empty) (#179) * Rashmi/kubenodeinventory (#180) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Exclude docker containers from container inventory (#181) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * Exclude pauseamd64 containers from container inventory (#182) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * Update agent version * Updating readme for the latest release * Fix indentation in kube.conf and update readme (#184) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * fixing indentation so that kube.conf contents can be used in config map in the yaml * updating readme to fix date and agent version * updating agent tag * Get Pods for current Node Only (#185) * Fix KubeAPI Calls to filter to get pods for current node * Reinstate log line * changes for container node inventory fixed type (#186) * Fix for mooncake (disable telemetry optionally) (#191) * disable telemetry option * fix a typo * CustomMetrics to ci_feature (#193) Custom Metrics changes to ci_feature * add ContainerNotRunning column to KubePodInventory * merge pr feedback: update name to ContainerStatusReason * Zero Fill for Missing Pod Phases, Change Namespace Dimension to Kubernetes namespace, as it might be confused with metrics namespace in Metrics Explorer (#194) * Zero Fill for Pod Counts by Phase * Change namespace dimension to Kubernetes namespace * No Retries for non 404 4xx errors (#196) * Update agent version for telemetry * Update readme for upcoming (ciprod01202019) release * fix readme formatting * fix formatting for readme * fix formatting for readme * fix readme * fix readme * fix agent version for telemetry * fix date in readme * update readme * Restart logs every 10MB instead of weekly (#198) * Rotate logs every 10MB instead of weekly * Removing some logging, fixed log rotation * update agent version for telemetry * update readme * Update kube.conf to use %STATE_DIR_WS% instead of hardcoded path * Fix AKSEngine Crash (#200) * hotfix * close resp.Body * remove chatty logs * membuf=5m and ignore files not updated since 5 mins * fix readme for new version * Fix the pod count in mdm agent plugin (#203) * Update readme * string freeze for out_mdm plugin * Vishwa/resourcecentric (#208) * resourceid fix (for AKS only) * fix name * Rashmi/win nodepool - PR (#206) * changes for win nodes enumeration * changes * changes * changes * node cpu metric rate changes * container cpu rate * changes * changes * changes * changes * changes * changes to include in_win_cadvisor_perf.rb file * send containerinventoryheartbeatevent * changes * cahnges for mdm metrics * changes * cahnges * changes * container states * changes * changes * changes for env variables * changes * changes * changes * changes * delete comments * changes * mutex changes * changes * changes * changes * telemetry fix for docker version * removing hardcoded values for mdm * update docker version * telemetry for windows cadvisor timeouts * exeception key update to computer * PR comments * adding os to container inventory for windows nodes (#210) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors (#211) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors * Fixing the bug, deferring telemetry changes for later * updating to lowercase compare for units (#212) * Merge from vishwa/telegraftcp to ci_feature for telegraf changes (#214) * merge from Vishwa/telegraf to Vishwa/telegraftcp for telegraf changes (#207) * add configuration for telegraf * fix for perms * fix telegraf config. * fix file location & config * update to config * fix namespace * trying different namespace and also debug=true * add placeholder for nodename * change namespace * updated config * fix uri * fix azMon settings * remove aad settings * add custom metrics regions * fix config * add support for replica-set config * fix oomkilled * Add telegraf 403 metric telemetry & non 403 trace telemetry * fix type * fix package * fix package import * fix filename * delete unused file * conf file for rs; fix 403counttotal metric for telegraf, remove host and use nodeName consistently, rename metrics * fix statefulsets * fix typo. * fix another typo. * fix telemetry * fix casing issue * fix comma issue. * disable telemetry for rs ; fix stateful set name * worksround for namespace fix * telegraf integration - v1 * telemetry changes for telegraf * telemetry & other changes * remove custom metric regions as we dont need anymore * remove un-needed files * fixes * exclude certain volumes and fix telemetry to not have computer & nodename as dimensions (redundant) * Vishwa/resourcecentric (#208) (#209) * resourceid fix (for AKS only) * fix name * near final metric shape * change from customlog to fixed type (InsightsMetrics) * fix PR feedback * fix pr feedback * Fix telemetry error for telegraf err count metric (#215) * Fix Unscheduled Pod bug, remove excess telemetry (#218) * Fix Unscheduled Pod bug, remove excess telemetry * Send Success Telemetry only once after startup for a node in a cluster for MDM Post * Sending telemetry for successful push to MDM every hour * Merge from Vishwa/promstandardmetrics into ci_feature (#220) * enable prometheus metrics collection in replica-set * fixing typos * fix config file path for replicaset * fix configuration * config changes * merge config/settings to ci_feature (#221) * updating fluentbit to use LOG_TAIL_PATH * changes * log exclusion pattern * changes * removing comments * adding enviornment varibale collection/disable * disable env var for cluster variable change * changes * toml parser changes * adding directory tomlrb * changes for container inventory * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Telemetry for config overrides * add schema version telemetry * reduce the number of api calls for namespace filtering add more telemetry for config processing move liveness probe & parser to this repo * optimize for default kube-system namespace log collection exclusion * Fix Scenario when Controller name is empty (#222) * fix ; * ContainerLog collection optimizations (#223) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * merge final changes for release from Vishwa/june2019agentrel to ci_feature (#224) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * fix a minor comment * * change flush from 5 to 10 secs based on perf findings * fix fluent bit tuning for perf run (#226) * fix fluent bit tuning for perf run * stop collecting our own partition * fix merge issue * add release notes for june release in ci_feature branch * fix title * update * fix title * Trim spaces in AKS_REGION (#233) This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Add Logs Size To Telemetry (#234) * Add Logs to telemetry * Using len instead of unsafe.Sizeof * Merge Vishwa/promcustommetrics to ci_feature (#237) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * Fix Region space error (#239) * Trim spaces in AKS_REGION This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Fix out_mdm parsing error * Removing buffer chunk size and buffer max size from fluentbit conf (#240) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * removing buffer chunk size and buffer max size from fluentbit conf * changes (#243) * Collect container last state (#235) * updating the OMS agent to also collect container last state * changed a comment * git surrounded ContainerLastStatus code in a begin/rescue block * added a lot of error checking and logging * Rashmi/fix prom telemetry (#247) * fix prom telemetry * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Merge Health Model work into ci_feature behind a feature flag Pending perf testing (#246) Merge Health to ci_feature * Fix Deserialization Bug (#249) * Fix the bug where capacity is not updated and cached value was being used (#251) * Fix the Capacity computation * fix node cpu and memory limits calculation * changes (#250) * Added new Custom Metrics Regions, fixed MDM plugin crash bug (#253) Added new regions, added handler for MDM plugin start * Add Missing Handlers (#254) * Added Missing Handlers * Return MultiEventStream.new instead of empty array (#256) * Added explicit require_relative to avoid loading errors (#258) * Adding explicit require_relative * Gangams/enable ai telemetry in mc (#252) * enable ai telemetry to configure different ikey and endpoint per cloud * Fixing null check out_mdm bug, tomlparser bug, exposing Replica Set service name as an ENV variable (#261) * Expose replica set service as an env variable * Fixing null check out_mdm bug, and tomlparser bug * Updating the env variable name to be more specific to health model * Changes for creating custom plugins with namespace settings for prometheus scraping (#262) * changes * changes * changes * changes * changes * changes * chnages * changes * telemetry changes * changes * Cherry-pick hotfix 09092019 to ci_feature (#265) * Gangams/add telemetry hybrid (#264) * add telemetry to detect the cloud, distro and kernel version * add null check since providerId optional * detect azurestack cloud * rename to KubernetesProviderID since ProviderID name already used in LA * capture workspaceCloud to the telemetry * trim the domain read from file * KubeMonAgentEvents changes to collect configuration events (#267) * changes * changes * changes * changes * changes * changes * env changes * changes * changes * changes * reverting * changes * cahnges * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * chnages * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Fix the Dupe Perf Data Issue from the DaemonSet (#266) * Dupe Perf Record Fix * PR for 1. Container Memory CPU monitor 2. Configuration for Node Conditions 3. Fixed Type Changes 4. Use Env variable, and health_forward (that handles network errors at init) 5. Unit Tests (#268) * init containers fix and other bug fixes (#269) * init container - KPI and kubeperf changes * changes * changes * changes * changes for empty array fix * changes * changes * pod inventory exception fix * nil check changes * changes * fixing typo * changes * changes * PR - feedback * remove comment * tag pass changes * changes * tagdrop changes * changes * changes * Send agg monitor signal on details change (#270) send when an agg monitor details change, but state did not change * bug fixes for error (#274) * Fix to use declaration and assignment instead of assignment (#275) * bug fixes for error * adding declaration to assignment * 1. Added telemetry (#277) 2. Configuration property changes 3. Bug fixes for a. unscheduled pods returning green 3b. Sometimes, the details hash of agg monitors are different because the order of elements inside the array is different, causing the records to be sent * Bug fix to remove unused variable (#281) * bug fixes for error * adding declaration to assignment * removing unused variable * Fix the WARN<->WARNING typo (#282) * Bug Fixes 1. telemetry send throwing exception if records not initialized 2. permissions error in on-prem clusters (#284) * Bug fixes 1. not writeable, telemetry error * Change to state_WS_dir * Fix Require relative revert (#287) * Bug Fixes for exceptions in telemetry, remove limit set check (#289) * Bug Fixes 10222019 * Initialize container_cpu_memory_records in fhmb * Added telemetry to investigate health exceptions * Set frozen_string_literal to true * Send event once per container when lookup is empty, or limit is an array * Unit Tests, Use RS and POD to determine workload * Fixed Node Condition Bug, added exception handling to return get_rs_owner_ref * Fix the bug where if a warning condition appears before fail condition, the node condition is reported as warning instead of fail. Also fix the node conditions state to consider unknown as a failure state (#292) * Fix for Nodes Aspect not showing up in draft cluster (#294) * Fix the issue where the health tree is inconsistent if a deployment is deleted (#295) * Rashmi/1 16 test (#297) * health deployment update * apps v1 changes for deployment * changes * changes to use relicasets and api groups * Fix duplicate records in container memory/cpu samples (#298) * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * fix exceptions (#306) * Merge Branch morgan into ci_feature (#308) * Fixes : 1) Disable health (for time being) - in DS & RS 2) Disable MDM (for time being) - in DS & RS 3) Merge kubeperf into kubenode & kubepod 4) Made scheduling predictable for kubenode & kubepod 5) Enable containerlog enrichment fields (timeofcommand, containername & containerimage) as a configurable setting (default = true/ON) - Also add telemetry for it 6) Filter OUT type!=Normal events for k8s events 7) AppInsights telemetry async 8) Fix double calling bug in in_win_cadvisor_perf 9) Add connect timeout (20secs) & read timeout (40 secs) for all cadvisor api calls & also for all kubernetes api server calls 10) Fix batchTime for kubepods to be one before making api server call (rather than after making the call, which will make it fluctuate based on api server latency for the call) * fix setting issue for the new enrichcontainerlog setting * fix compilation issue * fix another compilation issue * fix emit issues * fix a nil issue * fix mising tag * * Fix all input plugins for scheduling issue * Merge kubeservices with kubepodinventory (reduce RS to API server by one more) * Remove Kubelogs (not used) * Fix liveness probe * Disable enrichment by default for container logs * Move to yajl json parser across the board for docker provier code * Remove unused files * fix removed files * fix timeofcommand and remove a duplicate entry for a health file. * Rashmi/http leak fixes (#301) * changes for http connection close * close socket in ensure * adding nil check * Rashmi/http leak fixes (#303) * changes for http connection close * close socket in ensure * adding nil check * adding missing end * use yajl for events & nodes parsing. * Rashmi/http leak fixes (#304) * changes for http connection close * close socket in ensure * adding nil check * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * adding missing end * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * changes for chunking * telemetry changes * some fixes * bug fix * changing to have morgan changes only * add new line * use polltime for metrics and disable out_forward for health * enable mdm & health * few optimizations * do not remove time of command make kube.conf same as scale tested config * remove comments from container.conf * remove flush comment for ai telemetry * remove commented code lines * fix config * remove timeofcommand when enrichment==false * fix config * enable mdm filter * Rashmi/api chunk (#307) * changes * changes * refactor changes * changes * changes * changes * changes * node changes * changes * changes * changes * changes * adding open and read timeouts for api client * removing comments * updating chunk size * Update Readme * add back timeofcommand (#310) * Adding new cpu and memory limits to readme * CAdvisor to use 10255/10250 based on env variable (#321) * CAdvisor secure port changes (#320) * cadvsior secure port changes * update to use secure/insecure port for cadvisor * telemetry changes * fix bug * bug fix * changes * Adding cadvisor uri log * switching defaults * update readme * changes * changing font for code change and customer impact Co-authored-by: Vishwanath <visnara@microsoft.com> Co-authored-by: Dilip Raghunathan <dilip.rangarajan@gmail.com> Co-authored-by: bragi92 <kadubey@microsoft.com> Co-authored-by: David Michelman <daweim0@gmail.com> Co-authored-by: ganga1980 <gangams@microsoft.com>
jatakiajanvi12
pushed a commit
that referenced
this pull request
Dec 2, 2022
* Updatng release history * fixing the plugin logs for emit stream * updating log message * Remove Log Processing from fluentd configuration * Remove plugin references from base_container.data * Dilipr/fluent bit log processing (#126) * Build out_oms.so and include in docker-cimprov package * Adding fluent-bit-config file to base container * PR Feedback * Adding out_oms.conf to base_container.data * PR Feedback * Making the critical section as small as possible * PR Feedback * Fixing the newline bug for Computer, and changing containerId to Id * Dilipr/glide updates (#127) * Updating glide.* files to include lumberjack * containerID="" for pull issues * Using KubeAPI for getting image,name. Adding more logs (#129) * Using KubeAPI for getting image,name. Adding more logs * Moving log file and state file to within the omsagent container * Changing log and state paths * Dilipr/mark comments (#130) * Marks Comments + Error Handling * Drop records from files that are not in k8s format * Remove unnecessary log line' * Adding Log to the file that doesn't conform to the expected format * Rashmi/segfault latest (#132) * adding null checks in all providers * fixing type * fixing type * adding more null checks * update cjson * Adding a missed null check (#135) * reusing some variables (#136) * Rashmi/cjson delete null check (#138) * adding null check for cjson-delete * null chk * removing null check * updating log level to debug for some provider workflows (#139) * Fixing CPU Utilization and removing Fluent-bit filters (#140) Removing fluent-bit filters, CPU optimizations * Minor tweaks 1. Remove some logging 2. Added more Error Handling 3. Continue when there is an error with k8s api (#141) * Removing some logs, added more error checking, continue on kube-api error * Return FLB OK for json Marshall error, instead of RETRY * * Change FluentBit flush interval to 30 secs (from 5 secs) * Remove ContainerPerf, ContainerServiceLog,ContainerProcess (OMI workflows) for Daemonset * Container Log Telemetry * Fixing an issue with Send Init Event if Telemetry is not initialized properly, tab to whitespace in conf file * PR feedback * PR feedback * Sending an event every 5 mins(Heartbeat) (#146) * PR feedback to cleanup removed workflows * updating agent version for telemetry * updating agent version * Telemetry Updates (#149) * Telemetry Fixes 1. Added Log Generation Rate 2. Fixed parsing bugs 3. Added code to send Exceptions/errors * PR Feedback * Changes to send omsagent/omsagent-rs kubectl logs to App Insights (#159) * Changes to send omsagent/omsagent-rs kubectl logs to App Insights * PR Feedback * Rashmi/fluentd docker inventory (#160) * first stab * changes * changes * docker util changes * working tested util * input plugin and conf * changes * changes * changes * changes * changes * working containerinventory * fixing omi removal from container.conf * removing comments * file write and read * deleted containers working * changes * changes * socket timeout * deleting test files * adding log * fixing comment * appinsights changes * changes * tel changes * changes * changes * changes * changes * lib changes * changes * changes * fixes * PR comments * changes * updating the ownership * changes * changes * changes to container data * removing comment * changes * adding collection time * bug fix * env string truncation * changes for acs-engine test * Fix Telemetry Bug -- Initialize Telemetry Client after Initializing all required properties (#162) * Fix kube events memory leak due to yaml serialization for > 5k events (#163) * Setting Timeout for HTTP Client in PostDataHelper in outoms go plugin(#164) * Vishwa/perftelemetry 2 (#165) * add cpu usage telemetry for ds & rs * add cpu & memory usage telemetry for ds & rs * environment variable fix (#166) * environment variable fix * updating agent version * Fixing a bug where we were crashing due to container statuses not present when not was lost (#167) * Updating title * updating right versions for last release * Updating the break condition to look for end of response (#168) * Updating the break condition to look for end of response * changes for docker response * updating AgentVersion for telemetry * Updating readme for latest release changes * Changes - (#173) * use /var/log for state * new metric ContainerLogsAgentSideLatencyMs * new field 'timeOfComand' * Rashmi/kubenodeinventory (#174) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * Get cpuusage from usageseconds (#175) * Rashmi/kubenodeinventory (#176) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * Rashmi/kubenodeinventory (#178) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * Fixing an issue on the cpurate metric, which happens for the first time (when cache is empty) (#179) * Rashmi/kubenodeinventory (#180) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Exclude docker containers from container inventory (#181) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * Exclude pauseamd64 containers from container inventory (#182) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * Update agent version * Updating readme for the latest release * Fix indentation in kube.conf and update readme (#184) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * fixing indentation so that kube.conf contents can be used in config map in the yaml * updating readme to fix date and agent version * updating agent tag * Get Pods for current Node Only (#185) * Fix KubeAPI Calls to filter to get pods for current node * Reinstate log line * changes for container node inventory fixed type (#186) * Fix for mooncake (disable telemetry optionally) (#191) * disable telemetry option * fix a typo * CustomMetrics to ci_feature (#193) Custom Metrics changes to ci_feature * add ContainerNotRunning column to KubePodInventory * merge pr feedback: update name to ContainerStatusReason * Zero Fill for Missing Pod Phases, Change Namespace Dimension to Kubernetes namespace, as it might be confused with metrics namespace in Metrics Explorer (#194) * Zero Fill for Pod Counts by Phase * Change namespace dimension to Kubernetes namespace * No Retries for non 404 4xx errors (#196) * Update agent version for telemetry * Update readme for upcoming (ciprod01202019) release * fix readme formatting * fix formatting for readme * fix formatting for readme * fix readme * fix readme * fix agent version for telemetry * fix date in readme * update readme * Restart logs every 10MB instead of weekly (#198) * Rotate logs every 10MB instead of weekly * Removing some logging, fixed log rotation * update agent version for telemetry * update readme * Update kube.conf to use %STATE_DIR_WS% instead of hardcoded path * Fix AKSEngine Crash (#200) * hotfix * close resp.Body * remove chatty logs * membuf=5m and ignore files not updated since 5 mins * fix readme for new version * Fix the pod count in mdm agent plugin (#203) * Update readme * string freeze for out_mdm plugin * Vishwa/resourcecentric (#208) * resourceid fix (for AKS only) * fix name * Rashmi/win nodepool - PR (#206) * changes for win nodes enumeration * changes * changes * changes * node cpu metric rate changes * container cpu rate * changes * changes * changes * changes * changes * changes to include in_win_cadvisor_perf.rb file * send containerinventoryheartbeatevent * changes * cahnges for mdm metrics * changes * cahnges * changes * container states * changes * changes * changes for env variables * changes * changes * changes * changes * delete comments * changes * mutex changes * changes * changes * changes * telemetry fix for docker version * removing hardcoded values for mdm * update docker version * telemetry for windows cadvisor timeouts * exeception key update to computer * PR comments * adding os to container inventory for windows nodes (#210) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors (#211) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors * Fixing the bug, deferring telemetry changes for later * updating to lowercase compare for units (#212) * Merge from vishwa/telegraftcp to ci_feature for telegraf changes (#214) * merge from Vishwa/telegraf to Vishwa/telegraftcp for telegraf changes (#207) * add configuration for telegraf * fix for perms * fix telegraf config. * fix file location & config * update to config * fix namespace * trying different namespace and also debug=true * add placeholder for nodename * change namespace * updated config * fix uri * fix azMon settings * remove aad settings * add custom metrics regions * fix config * add support for replica-set config * fix oomkilled * Add telegraf 403 metric telemetry & non 403 trace telemetry * fix type * fix package * fix package import * fix filename * delete unused file * conf file for rs; fix 403counttotal metric for telegraf, remove host and use nodeName consistently, rename metrics * fix statefulsets * fix typo. * fix another typo. * fix telemetry * fix casing issue * fix comma issue. * disable telemetry for rs ; fix stateful set name * worksround for namespace fix * telegraf integration - v1 * telemetry changes for telegraf * telemetry & other changes * remove custom metric regions as we dont need anymore * remove un-needed files * fixes * exclude certain volumes and fix telemetry to not have computer & nodename as dimensions (redundant) * Vishwa/resourcecentric (#208) (#209) * resourceid fix (for AKS only) * fix name * near final metric shape * change from customlog to fixed type (InsightsMetrics) * fix PR feedback * fix pr feedback * Fix telemetry error for telegraf err count metric (#215) * Fix Unscheduled Pod bug, remove excess telemetry (#218) * Fix Unscheduled Pod bug, remove excess telemetry * Send Success Telemetry only once after startup for a node in a cluster for MDM Post * Sending telemetry for successful push to MDM every hour * Merge from Vishwa/promstandardmetrics into ci_feature (#220) * enable prometheus metrics collection in replica-set * fixing typos * fix config file path for replicaset * fix configuration * config changes * merge config/settings to ci_feature (#221) * updating fluentbit to use LOG_TAIL_PATH * changes * log exclusion pattern * changes * removing comments * adding enviornment varibale collection/disable * disable env var for cluster variable change * changes * toml parser changes * adding directory tomlrb * changes for container inventory * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Telemetry for config overrides * add schema version telemetry * reduce the number of api calls for namespace filtering add more telemetry for config processing move liveness probe & parser to this repo * optimize for default kube-system namespace log collection exclusion * Fix Scenario when Controller name is empty (#222) * fix ; * ContainerLog collection optimizations (#223) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * merge final changes for release from Vishwa/june2019agentrel to ci_feature (#224) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * fix a minor comment * * change flush from 5 to 10 secs based on perf findings * fix fluent bit tuning for perf run (#226) * fix fluent bit tuning for perf run * stop collecting our own partition * fix merge issue * add release notes for june release in ci_feature branch * fix title * update * fix title * Trim spaces in AKS_REGION (#233) This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Add Logs Size To Telemetry (#234) * Add Logs to telemetry * Using len instead of unsafe.Sizeof * Merge Vishwa/promcustommetrics to ci_feature (#237) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * Fix Region space error (#239) * Trim spaces in AKS_REGION This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Fix out_mdm parsing error * Removing buffer chunk size and buffer max size from fluentbit conf (#240) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * removing buffer chunk size and buffer max size from fluentbit conf * changes (#243) * Collect container last state (#235) * updating the OMS agent to also collect container last state * changed a comment * git surrounded ContainerLastStatus code in a begin/rescue block * added a lot of error checking and logging * Rashmi/fix prom telemetry (#247) * fix prom telemetry * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Merge Health Model work into ci_feature behind a feature flag Pending perf testing (#246) Merge Health to ci_feature * Fix Deserialization Bug (#249) * Fix the bug where capacity is not updated and cached value was being used (#251) * Fix the Capacity computation * fix node cpu and memory limits calculation * changes (#250) * Added new Custom Metrics Regions, fixed MDM plugin crash bug (#253) Added new regions, added handler for MDM plugin start * Add Missing Handlers (#254) * Added Missing Handlers * Return MultiEventStream.new instead of empty array (#256) * Added explicit require_relative to avoid loading errors (#258) * Adding explicit require_relative * Gangams/enable ai telemetry in mc (#252) * enable ai telemetry to configure different ikey and endpoint per cloud * Fixing null check out_mdm bug, tomlparser bug, exposing Replica Set service name as an ENV variable (#261) * Expose replica set service as an env variable * Fixing null check out_mdm bug, and tomlparser bug * Updating the env variable name to be more specific to health model * Changes for creating custom plugins with namespace settings for prometheus scraping (#262) * changes * changes * changes * changes * changes * changes * chnages * changes * telemetry changes * changes * Cherry-pick hotfix 09092019 to ci_feature (#265) * Gangams/add telemetry hybrid (#264) * add telemetry to detect the cloud, distro and kernel version * add null check since providerId optional * detect azurestack cloud * rename to KubernetesProviderID since ProviderID name already used in LA * capture workspaceCloud to the telemetry * trim the domain read from file * KubeMonAgentEvents changes to collect configuration events (#267) * changes * changes * changes * changes * changes * changes * env changes * changes * changes * changes * reverting * changes * cahnges * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * chnages * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Fix the Dupe Perf Data Issue from the DaemonSet (#266) * Dupe Perf Record Fix * PR for 1. Container Memory CPU monitor 2. Configuration for Node Conditions 3. Fixed Type Changes 4. Use Env variable, and health_forward (that handles network errors at init) 5. Unit Tests (#268) * init containers fix and other bug fixes (#269) * init container - KPI and kubeperf changes * changes * changes * changes * changes for empty array fix * changes * changes * pod inventory exception fix * nil check changes * changes * fixing typo * changes * changes * PR - feedback * remove comment * tag pass changes * changes * tagdrop changes * changes * changes * Send agg monitor signal on details change (#270) send when an agg monitor details change, but state did not change * bug fixes for error (#274) * Fix to use declaration and assignment instead of assignment (#275) * bug fixes for error * adding declaration to assignment * 1. Added telemetry (#277) 2. Configuration property changes 3. Bug fixes for a. unscheduled pods returning green 3b. Sometimes, the details hash of agg monitors are different because the order of elements inside the array is different, causing the records to be sent * Bug fix to remove unused variable (#281) * bug fixes for error * adding declaration to assignment * removing unused variable * Fix the WARN<->WARNING typo (#282) * Bug Fixes 1. telemetry send throwing exception if records not initialized 2. permissions error in on-prem clusters (#284) * Bug fixes 1. not writeable, telemetry error * Change to state_WS_dir * Fix Require relative revert (#287) * Bug Fixes for exceptions in telemetry, remove limit set check (#289) * Bug Fixes 10222019 * Initialize container_cpu_memory_records in fhmb * Added telemetry to investigate health exceptions * Set frozen_string_literal to true * Send event once per container when lookup is empty, or limit is an array * Unit Tests, Use RS and POD to determine workload * Fixed Node Condition Bug, added exception handling to return get_rs_owner_ref * Fix the bug where if a warning condition appears before fail condition, the node condition is reported as warning instead of fail. Also fix the node conditions state to consider unknown as a failure state (#292) * Fix for Nodes Aspect not showing up in draft cluster (#294) * Fix the issue where the health tree is inconsistent if a deployment is deleted (#295) * Rashmi/1 16 test (#297) * health deployment update * apps v1 changes for deployment * changes * changes to use relicasets and api groups * Fix duplicate records in container memory/cpu samples (#298) * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * fix exceptions (#306) * Merge Branch morgan into ci_feature (#308) * Fixes : 1) Disable health (for time being) - in DS & RS 2) Disable MDM (for time being) - in DS & RS 3) Merge kubeperf into kubenode & kubepod 4) Made scheduling predictable for kubenode & kubepod 5) Enable containerlog enrichment fields (timeofcommand, containername & containerimage) as a configurable setting (default = true/ON) - Also add telemetry for it 6) Filter OUT type!=Normal events for k8s events 7) AppInsights telemetry async 8) Fix double calling bug in in_win_cadvisor_perf 9) Add connect timeout (20secs) & read timeout (40 secs) for all cadvisor api calls & also for all kubernetes api server calls 10) Fix batchTime for kubepods to be one before making api server call (rather than after making the call, which will make it fluctuate based on api server latency for the call) * fix setting issue for the new enrichcontainerlog setting * fix compilation issue * fix another compilation issue * fix emit issues * fix a nil issue * fix mising tag * * Fix all input plugins for scheduling issue * Merge kubeservices with kubepodinventory (reduce RS to API server by one more) * Remove Kubelogs (not used) * Fix liveness probe * Disable enrichment by default for container logs * Move to yajl json parser across the board for docker provier code * Remove unused files * fix removed files * fix timeofcommand and remove a duplicate entry for a health file. * Rashmi/http leak fixes (#301) * changes for http connection close * close socket in ensure * adding nil check * Rashmi/http leak fixes (#303) * changes for http connection close * close socket in ensure * adding nil check * adding missing end * use yajl for events & nodes parsing. * Rashmi/http leak fixes (#304) * changes for http connection close * close socket in ensure * adding nil check * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * adding missing end * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * changes for chunking * telemetry changes * some fixes * bug fix * changing to have morgan changes only * add new line * use polltime for metrics and disable out_forward for health * enable mdm & health * few optimizations * do not remove time of command make kube.conf same as scale tested config * remove comments from container.conf * remove flush comment for ai telemetry * remove commented code lines * fix config * remove timeofcommand when enrichment==false * fix config * enable mdm filter * Rashmi/api chunk (#307) * changes * changes * refactor changes * changes * changes * changes * changes * node changes * changes * changes * changes * changes * adding open and read timeouts for api client * removing comments * updating chunk size * Update Readme * add back timeofcommand (#310) * update readme for timeofcommand fix (#314) * Merge from ci_feature_prod into ci_feature (fix put back timeofcommand) (#311) (#316) * Updatng release history * fixing the plugin logs for emit stream * updating log message * Remove Log Processing from fluentd configuration * Remove plugin references from base_container.data * Dilipr/fluent bit log processing (#126) * Build out_oms.so and include in docker-cimprov package * Adding fluent-bit-config file to base container * PR Feedback * Adding out_oms.conf to base_container.data * PR Feedback * Making the critical section as small as possible * PR Feedback * Fixing the newline bug for Computer, and changing containerId to Id * Dilipr/glide updates (#127) * Updating glide.* files to include lumberjack * containerID="" for pull issues * Using KubeAPI for getting image,name. Adding more logs (#129) * Using KubeAPI for getting image,name. Adding more logs * Moving log file and state file to within the omsagent container * Changing log and state paths * Dilipr/mark comments (#130) * Marks Comments + Error Handling * Drop records from files that are not in k8s format * Remove unnecessary log line' * Adding Log to the file that doesn't conform to the expected format * Rashmi/segfault latest (#132) * adding null checks in all providers * fixing type * fixing type * adding more null checks * update cjson * Adding a missed null check (#135) * reusing some variables (#136) * Rashmi/cjson delete null check (#138) * adding null check for cjson-delete * null chk * removing null check * updating log level to debug for some provider workflows (#139) * Fixing CPU Utilization and removing Fluent-bit filters (#140) Removing fluent-bit filters, CPU optimizations * Minor tweaks 1. Remove some logging 2. Added more Error Handling 3. Continue when there is an error with k8s api (#141) * Removing some logs, added more error checking, continue on kube-api error * Return FLB OK for json Marshall error, instead of RETRY * * Change FluentBit flush interval to 30 secs (from 5 secs) * Remove ContainerPerf, ContainerServiceLog,ContainerProcess (OMI workflows) for Daemonset * Container Log Telemetry * Fixing an issue with Send Init Event if Telemetry is not initialized properly, tab to whitespace in conf file * PR feedback * PR feedback * Sending an event every 5 mins(Heartbeat) (#146) * PR feedback to cleanup removed workflows * updating agent version for telemetry * updating agent version * Telemetry Updates (#149) * Telemetry Fixes 1. Added Log Generation Rate 2. Fixed parsing bugs 3. Added code to send Exceptions/errors * PR Feedback * Changes to send omsagent/omsagent-rs kubectl logs to App Insights (#159) * Changes to send omsagent/omsagent-rs kubectl logs to App Insights * PR Feedback * Rashmi/fluentd docker inventory (#160) * first stab * changes * changes * docker util changes * working tested util * input plugin and conf * changes * changes * changes * changes * changes * working containerinventory * fixing omi removal from container.conf * removing comments * file write and read * deleted containers working * changes * changes * socket timeout * deleting test files * adding log * fixing comment * appinsights changes * changes * tel changes * changes * changes * changes * changes * lib changes * changes * changes * fixes * PR comments * changes * updating the ownership * changes * changes * changes to container data * removing comment * changes * adding collection time * bug fix * env string truncation * changes for acs-engine test * Fix Telemetry Bug -- Initialize Telemetry Client after Initializing all required properties (#162) * Fix kube events memory leak due to yaml serialization for > 5k events (#163) * Setting Timeout for HTTP Client in PostDataHelper in outoms go plugin(#164) * Vishwa/perftelemetry 2 (#165) * add cpu usage telemetry for ds & rs * add cpu & memory usage telemetry for ds & rs * environment variable fix (#166) * environment variable fix * updating agent version * Fixing a bug where we were crashing due to container statuses not present when not was lost (#167) * Updating title * updating right versions for last release * Updating the break condition to look for end of response (#168) * Updating the break condition to look for end of response * changes for docker response * updating AgentVersion for telemetry * Updating readme for latest release changes * Changes - (#173) * use /var/log for state * new metric ContainerLogsAgentSideLatencyMs * new field 'timeOfComand' * Rashmi/kubenodeinventory (#174) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * Get cpuusage from usageseconds (#175) * Rashmi/kubenodeinventory (#176) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * Rashmi/kubenodeinventory (#178) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * Fixing an issue on the cpurate metric, which happens for the first time (when cache is empty) (#179) * Rashmi/kubenodeinventory (#180) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Exclude docker containers from container inventory (#181) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * Exclude pauseamd64 containers from container inventory (#182) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * Update agent version * Updating readme for the latest release * Fix indentation in kube.conf and update readme (#184) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * fixing indentation so that kube.conf contents can be used in config map in the yaml * updating readme to fix date and agent version * updating agent tag * Get Pods for current Node Only (#185) * Fix KubeAPI Calls to filter to get pods for current node * Reinstate log line * changes for container node inventory fixed type (#186) * Fix for mooncake (disable telemetry optionally) (#191) * disable telemetry option * fix a typo * CustomMetrics to ci_feature (#193) Custom Metrics changes to ci_feature * add ContainerNotRunning column to KubePodInventory * merge pr feedback: update name to ContainerStatusReason * Zero Fill for Missing Pod Phases, Change Namespace Dimension to Kubernetes namespace, as it might be confused with metrics namespace in Metrics Explorer (#194) * Zero Fill for Pod Counts by Phase * Change namespace dimension to Kubernetes namespace * No Retries for non 404 4xx errors (#196) * Update agent version for telemetry * Update readme for upcoming (ciprod01202019) release * fix readme formatting * fix formatting for readme * fix formatting for readme * fix readme * fix readme * fix agent version for telemetry * fix date in readme * update readme * Restart logs every 10MB instead of weekly (#198) * Rotate logs every 10MB instead of weekly * Removing some logging, fixed log rotation * update agent version for telemetry * update readme * Update kube.conf to use %STATE_DIR_WS% instead of hardcoded path * Fix AKSEngine Crash (#200) * hotfix * close resp.Body * remove chatty logs * membuf=5m and ignore files not updated since 5 mins * fix readme for new version * Fix the pod count in mdm agent plugin (#203) * Update readme * string freeze for out_mdm plugin * Vishwa/resourcecentric (#208) * resourceid fix (for AKS only) * fix name * Rashmi/win nodepool - PR (#206) * changes for win nodes enumeration * changes * changes * changes * node cpu metric rate changes * container cpu rate * changes * changes * changes * changes * changes * changes to include in_win_cadvisor_perf.rb file * send containerinventoryheartbeatevent * changes * cahnges for mdm metrics * changes * cahnges * changes * container states * changes * changes * changes for env variables * changes * changes * changes * changes * delete comments * changes * mutex changes * changes * changes * changes * telemetry fix for docker version * removing hardcoded values for mdm * update docker version * telemetry for windows cadvisor timeouts * exeception key update to computer * PR comments * adding os to container inventory for windows nodes (#210) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors (#211) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors * Fixing the bug, deferring telemetry changes for later * updating to lowercase compare for units (#212) * Merge from vishwa/telegraftcp to ci_feature for telegraf changes (#214) * merge from Vishwa/telegraf to Vishwa/telegraftcp for telegraf changes (#207) * add configuration for telegraf * fix for perms * fix telegraf config. * fix file location & config * update to config * fix namespace * trying different namespace and also debug=true * add placeholder for nodename * change namespace * updated config * fix uri * fix azMon settings * remove aad settings * add custom metrics regions * fix config * add support for replica-set config * fix oomkilled * Add telegraf 403 metric telemetry & non 403 trace telemetry * fix type * fix package * fix package import * fix filename * delete unused file * conf file for rs; fix 403counttotal metric for telegraf, remove host and use nodeName consistently, rename metrics * fix statefulsets * fix typo. * fix another typo. * fix telemetry * fix casing issue * fix comma issue. * disable telemetry for rs ; fix stateful set name * worksround for namespace fix * telegraf integration - v1 * telemetry changes for telegraf * telemetry & other changes * remove custom metric regions as we dont need anymore * remove un-needed files * fixes * exclude certain volumes and fix telemetry to not have computer & nodename as dimensions (redundant) * Vishwa/resourcecentric (#208) (#209) * resourceid fix (for AKS only) * fix name * near final metric shape * change from customlog to fixed type (InsightsMetrics) * fix PR feedback * fix pr feedback * Fix telemetry error for telegraf err count metric (#215) * Fix Unscheduled Pod bug, remove excess telemetry (#218) * Fix Unscheduled Pod bug, remove excess telemetry * Send Success Telemetry only once after startup for a node in a cluster for MDM Post * Sending telemetry for successful push to MDM every hour * Merge from Vishwa/promstandardmetrics into ci_feature (#220) * enable prometheus metrics collection in replica-set * fixing typos * fix config file path for replicaset * fix configuration * config changes * merge config/settings to ci_feature (#221) * updating fluentbit to use LOG_TAIL_PATH * changes * log exclusion pattern * changes * removing comments * adding enviornment varibale collection/disable * disable env var for cluster variable change * changes * toml parser changes * adding directory tomlrb * changes for container inventory * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Telemetry for config overrides * add schema version telemetry * reduce the number of api calls for namespace filtering add more telemetry for config processing move liveness probe & parser to this repo * optimize for default kube-system namespace log collection exclusion * Fix Scenario when Controller name is empty (#222) * fix ; * ContainerLog collection optimizations (#223) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * merge final changes for release from Vishwa/june2019agentrel to ci_feature (#224) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * fix a minor comment * * change flush from 5 to 10 secs based on perf findings * fix fluent bit tuning for perf run (#226) * fix fluent bit tuning for perf run * stop collecting our own partition * fix merge issue * add release notes for june release in ci_feature branch * fix title * update * fix title * Trim spaces in AKS_REGION (#233) This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Add Logs Size To Telemetry (#234) * Add Logs to telemetry * Using len instead of unsafe.Sizeof * Merge Vishwa/promcustommetrics to ci_feature (#237) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * Fix Region space error (#239) * Trim spaces in AKS_REGION This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Fix out_mdm parsing error * Removing buffer chunk size and buffer max size from fluentbit conf (#240) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * removing buffer chunk size and buffer max size from fluentbit conf * changes (#243) * Collect container last state (#235) * updating the OMS agent to also collect container last state * changed a comment * git surrounded ContainerLastStatus code in a begin/rescue block * added a lot of error checking and logging * Rashmi/fix prom telemetry (#247) * fix prom telemetry * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Merge Health Model work into ci_feature behind a feature flag Pending perf testing (#246) Merge Health to ci_feature * Fix Deserialization Bug (#249) * Fix the bug where capacity is not updated and cached value was being used (#251) * Fix the Capacity computation * fix node cpu and memory limits calculation * changes (#250) * Added new Custom Metrics Regions, fixed MDM plugin crash bug (#253) Added new regions, added handler for MDM plugin start * Add Missing Handlers (#254) * Added Missing Handlers * Return MultiEventStream.new instead of empty array (#256) * Added explicit require_relative to avoid loading errors (#258) * Adding explicit require_relative * Gangams/enable ai telemetry in mc (#252) * enable ai telemetry to configure different ikey and endpoint per cloud * Fixing null check out_mdm bug, tomlparser bug, exposing Replica Set service name as an ENV variable (#261) * Expose replica set service as an env variable * Fixing null check out_mdm bug, and tomlparser bug * Updating the env variable name to be more specific to health model * Changes for creating custom plugins with namespace settings for prometheus scraping (#262) * changes * changes * changes * changes * changes * changes * chnages * changes * telemetry changes * changes * Cherry-pick hotfix 09092019 to ci_feature (#265) * Gangams/add telemetry hybrid (#264) * add telemetry to detect the cloud, distro and kernel version * add null check since providerId optional * detect azurestack cloud * rename to KubernetesProviderID since ProviderID name already used in LA * capture workspaceCloud to the telemetry * trim the domain read from file * KubeMonAgentEvents changes to collect configuration events (#267) * changes * changes * changes * changes * changes * changes * env changes * changes * changes * changes * reverting * changes * cahnges * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * chnages * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Fix the Dupe Perf Data Issue from the DaemonSet (#266) * Dupe Perf Record Fix * PR for 1. Container Memory CPU monitor 2. Configuration for Node Conditions 3. Fixed Type Changes 4. Use Env variable, and health_forward (that handles network errors at init) 5. Unit Tests (#268) * init containers fix and other bug fixes (#269) * init container - KPI and kubeperf changes * changes * changes * changes * changes for empty array fix * changes * changes * pod inventory exception fix * nil check changes * changes * fixing typo * changes * changes * PR - feedback * remove comment * tag pass changes * changes * tagdrop changes * changes * changes * Send agg monitor signal on details change (#270) send when an agg monitor details change, but state did not change * bug fixes for error (#274) * Fix to use declaration and assignment instead of assignment (#275) * bug fixes for error * adding declaration to assignment * 1. Added telemetry (#277) 2. Configuration property changes 3. Bug fixes for a. unscheduled pods returning green 3b. Sometimes, the details hash of agg monitors are different because the order of elements inside the array is different, causing the records to be sent * Bug fix to remove unused variable (#281) * bug fixes for error * adding declaration to assignment * removing unused variable * Fix the WARN<->WARNING typo (#282) * Bug Fixes 1. telemetry send throwing exception if records not initialized 2. permissions error in on-prem clusters (#284) * Bug fixes 1. not writeable, telemetry error * Change to state_WS_dir * Fix Require relative revert (#287) * Bug Fixes for exceptions in telemetry, remove limit set check (#289) * Bug Fixes 10222019 * Initialize container_cpu_memory_records in fhmb * Added telemetry to investigate health exceptions * Set frozen_string_literal to true * Send event once per container when lookup is empty, or limit is an array * Unit Tests, Use RS and POD to determine workload * Fixed Node Condition Bug, added exception handling to return get_rs_owner_ref * Fix the bug where if a warning condition appears before fail condition, the node condition is reported as warning instead of fail. Also fix the node conditions state to consider unknown as a failure state (#292) * Fix for Nodes Aspect not showing up in draft cluster (#294) * Fix the issue where the health tree is inconsistent if a deployment is deleted (#295) * Rashmi/1 16 test (#297) * health deployment update * apps v1 changes for deployment * changes * changes to use relicasets and api groups * Fix duplicate records in container memory/cpu samples (#298) * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * fix exceptions (#306) * Merge Branch morgan into ci_feature (#308) * Fixes : 1) Disable health (for time being) - in DS & RS 2) Disable MDM (for time being) - in DS & RS 3) Merge kubeperf into kubenode & kubepod 4) Made scheduling predictable for kubenode & kubepod 5) Enable containerlog enrichment fields (timeofcommand, containername & containerimage) as a configurable setting (default = true/ON) - Also add telemetry for it 6) Filter OUT type!=Normal events for k8s events 7) AppInsights telemetry async 8) Fix double calling bug in in_win_cadvisor_perf 9) Add connect timeout (20secs) & read timeout (40 secs) for all cadvisor api calls & also for all kubernetes api server calls 10) Fix batchTime for kubepods to be one before making api server call (rather than after making the call, which will make it fluctuate based on api server latency for the call) * fix setting issue for the new enrichcontainerlog setting * fix compilation issue * fix another compilation issue * fix emit issues * fix a nil issue * fix mising tag * * Fix all input plugins for scheduling issue * Merge kubeservices with kubepodinventory (reduce RS to API server by one more) * Remove Kubelogs (not used) * Fix liveness probe * Disable enrichment by default for container logs * Move to yajl json parser across the board for docker provier code * Remove unused files * fix removed files * fix timeofcommand and remove a duplicate entry for a health file. * Rashmi/http leak fixes (#301) * changes for http connection close * close socket in ensure * adding nil check * Rashmi/http leak fixes (#303) * changes for http connection close * close socket in ensure * adding nil check * adding missing end * use yajl for events & nodes parsing. * Rashmi/http leak fixes (#304) * changes for http connection close * close socket in ensure * adding nil check * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * adding missing end * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * changes for chunking * telemetry changes * some fixes * bug fix * changing to have morgan changes only * add new line * use polltime for metrics and disable out_forward for health * enable mdm & health * few optimizations * do not remove time of command make kube.conf same as scale tested config * remove comments from container.conf * remove flush comment for ai telemetry * remove commented code lines * fix config * remove timeofcommand when enrichment==false * fix config * enable mdm filter * Rashmi/api chunk (#307) * changes * changes * refactor changes * changes * changes * changes * changes * node changes * changes * changes * changes * changes * adding open and read timeouts for api client * removing comments * updating chunk size * Update Readme * add back timeofcommand (#310) * Adding new cpu and memory limits to readme * CAdvisor to use 10255/10250 based on env variable (#321) * CAdvisor secure port changes (#320) * cadvsior secure port changes * update to use secure/insecure port for cadvisor * telemetry changes * fix bug * bug fix * changes * Adding cadvisor uri log * switching defaults * update readme * changes * changing font for code change and customer impact * For ARO, stop collecting inventory of master and infra (#323) * filter out infra and master nodes inventory for aro * filterout pods info scheduled master and infra nodes * fix redundant KubernetesApiClient name * filter out events sourced from master and infra nodes * fix in kubeapi * add the comments * fix pr feedback * minor updates * fix pr feedback * encode special characters in query * some refactoring * MDM plugin support for large scale clusters (#324) * Batch Commit * WIP: Committing move logic from filter to input * WIP : MDM plugins for scale clusters * Bug fixes 1. cpu percentage 2. bytesize on array. Remove log line * Fixing metric value in cadvisor2mdm plugin * WIP to laptop * Working version with cadvisor changes * Fix Health cpu usage * Added uri for cadvisor failure * Add Null check for kube api responses in in_kube_health (#325) * Fix casing bug (#326) * Missed kube.conf update (#327) * changes to use msi if service principal does not exist (#328) changes to use msi if service principal does not exist (#328) * Adding caseinsensitive compare (#330) Adding case insensitive compare * gpu monitoring (#329) * gpu monitoring * Emit info log for tests for the new insightsmetrics data stream Co-authored-by: Vishwanath <visnara@microsoft.com> Co-authored-by: Dilip Raghunathan <dilip.rangarajan@gmail.com> Co-authored-by: bragi92 <kadubey@microsoft.com> Co-authored-by: David Michelman <daweim0@gmail.com> Co-authored-by: ganga1980 <gangams@microsoft.com>
jatakiajanvi12
pushed a commit
that referenced
this pull request
Dec 2, 2022
* Updatng release history * fixing the plugin logs for emit stream * updating log message * Remove Log Processing from fluentd configuration * Remove plugin references from base_container.data * Dilipr/fluent bit log processing (#126) * Build out_oms.so and include in docker-cimprov package * Adding fluent-bit-config file to base container * PR Feedback * Adding out_oms.conf to base_container.data * PR Feedback * Making the critical section as small as possible * PR Feedback * Fixing the newline bug for Computer, and changing containerId to Id * Dilipr/glide updates (#127) * Updating glide.* files to include lumberjack * containerID="" for pull issues * Using KubeAPI for getting image,name. Adding more logs (#129) * Using KubeAPI for getting image,name. Adding more logs * Moving log file and state file to within the omsagent container * Changing log and state paths * Dilipr/mark comments (#130) * Marks Comments + Error Handling * Drop records from files that are not in k8s format * Remove unnecessary log line' * Adding Log to the file that doesn't conform to the expected format * Rashmi/segfault latest (#132) * adding null checks in all providers * fixing type * fixing type * adding more null checks * update cjson * Adding a missed null check (#135) * reusing some variables (#136) * Rashmi/cjson delete null check (#138) * adding null check for cjson-delete * null chk * removing null check * updating log level to debug for some provider workflows (#139) * Fixing CPU Utilization and removing Fluent-bit filters (#140) Removing fluent-bit filters, CPU optimizations * Minor tweaks 1. Remove some logging 2. Added more Error Handling 3. Continue when there is an error with k8s api (#141) * Removing some logs, added more error checking, continue on kube-api error * Return FLB OK for json Marshall error, instead of RETRY * * Change FluentBit flush interval to 30 secs (from 5 secs) * Remove ContainerPerf, ContainerServiceLog,ContainerProcess (OMI workflows) for Daemonset * Container Log Telemetry * Fixing an issue with Send Init Event if Telemetry is not initialized properly, tab to whitespace in conf file * PR feedback * PR feedback * Sending an event every 5 mins(Heartbeat) (#146) * PR feedback to cleanup removed workflows * updating agent version for telemetry * updating agent version * Telemetry Updates (#149) * Telemetry Fixes 1. Added Log Generation Rate 2. Fixed parsing bugs 3. Added code to send Exceptions/errors * PR Feedback * Changes to send omsagent/omsagent-rs kubectl logs to App Insights (#159) * Changes to send omsagent/omsagent-rs kubectl logs to App Insights * PR Feedback * Rashmi/fluentd docker inventory (#160) * first stab * changes * changes * docker util changes * working tested util * input plugin and conf * changes * changes * changes * changes * changes * working containerinventory * fixing omi removal from container.conf * removing comments * file write and read * deleted containers working * changes * changes * socket timeout * deleting test files * adding log * fixing comment * appinsights changes * changes * tel changes * changes * changes * changes * changes * lib changes * changes * changes * fixes * PR comments * changes * updating the ownership * changes * changes * changes to container data * removing comment * changes * adding collection time * bug fix * env string truncation * changes for acs-engine test * Fix Telemetry Bug -- Initialize Telemetry Client after Initializing all required properties (#162) * Fix kube events memory leak due to yaml serialization for > 5k events (#163) * Setting Timeout for HTTP Client in PostDataHelper in outoms go plugin(#164) * Vishwa/perftelemetry 2 (#165) * add cpu usage telemetry for ds & rs * add cpu & memory usage telemetry for ds & rs * environment variable fix (#166) * environment variable fix * updating agent version * Fixing a bug where we were crashing due to container statuses not present when not was lost (#167) * Updating title * updating right versions for last release * Updating the break condition to look for end of response (#168) * Updating the break condition to look for end of response * changes for docker response * updating AgentVersion for telemetry * Updating readme for latest release changes * Changes - (#173) * use /var/log for state * new metric ContainerLogsAgentSideLatencyMs * new field 'timeOfComand' * Rashmi/kubenodeinventory (#174) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * Get cpuusage from usageseconds (#175) * Rashmi/kubenodeinventory (#176) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * Rashmi/kubenodeinventory (#178) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * Fixing an issue on the cpurate metric, which happens for the first time (when cache is empty) (#179) * Rashmi/kubenodeinventory (#180) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Exclude docker containers from container inventory (#181) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * Exclude pauseamd64 containers from container inventory (#182) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * Update agent version * Updating readme for the latest release * Fix indentation in kube.conf and update readme (#184) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * fixing indentation so that kube.conf contents can be used in config map in the yaml * updating readme to fix date and agent version * updating agent tag * Get Pods for current Node Only (#185) * Fix KubeAPI Calls to filter to get pods for current node * Reinstate log line * changes for container node inventory fixed type (#186) * Fix for mooncake (disable telemetry optionally) (#191) * disable telemetry option * fix a typo * CustomMetrics to ci_feature (#193) Custom Metrics changes to ci_feature * add ContainerNotRunning column to KubePodInventory * merge pr feedback: update name to ContainerStatusReason * Zero Fill for Missing Pod Phases, Change Namespace Dimension to Kubernetes namespace, as it might be confused with metrics namespace in Metrics Explorer (#194) * Zero Fill for Pod Counts by Phase * Change namespace dimension to Kubernetes namespace * No Retries for non 404 4xx errors (#196) * Update agent version for telemetry * Update readme for upcoming (ciprod01202019) release * fix readme formatting * fix formatting for readme * fix formatting for readme * fix readme * fix readme * fix agent version for telemetry * fix date in readme * update readme * Restart logs every 10MB instead of weekly (#198) * Rotate logs every 10MB instead of weekly * Removing some logging, fixed log rotation * update agent version for telemetry * update readme * Update kube.conf to use %STATE_DIR_WS% instead of hardcoded path * Fix AKSEngine Crash (#200) * hotfix * close resp.Body * remove chatty logs * membuf=5m and ignore files not updated since 5 mins * fix readme for new version * Fix the pod count in mdm agent plugin (#203) * Update readme * string freeze for out_mdm plugin * Vishwa/resourcecentric (#208) * resourceid fix (for AKS only) * fix name * Rashmi/win nodepool - PR (#206) * changes for win nodes enumeration * changes * changes * changes * node cpu metric rate changes * container cpu rate * changes * changes * changes * changes * changes * changes to include in_win_cadvisor_perf.rb file * send containerinventoryheartbeatevent * changes * cahnges for mdm metrics * changes * cahnges * changes * container states * changes * changes * changes for env variables * changes * changes * changes * changes * delete comments * changes * mutex changes * changes * changes * changes * telemetry fix for docker version * removing hardcoded values for mdm * update docker version * telemetry for windows cadvisor timeouts * exeception key update to computer * PR comments * adding os to container inventory for windows nodes (#210) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors (#211) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors * Fixing the bug, deferring telemetry changes for later * updating to lowercase compare for units (#212) * Merge from vishwa/telegraftcp to ci_feature for telegraf changes (#214) * merge from Vishwa/telegraf to Vishwa/telegraftcp for telegraf changes (#207) * add configuration for telegraf * fix for perms * fix telegraf config. * fix file location & config * update to config * fix namespace * trying different namespace and also debug=true * add placeholder for nodename * change namespace * updated config * fix uri * fix azMon settings * remove aad settings * add custom metrics regions * fix config * add support for replica-set config * fix oomkilled * Add telegraf 403 metric telemetry & non 403 trace telemetry * fix type * fix package * fix package import * fix filename * delete unused file * conf file for rs; fix 403counttotal metric for telegraf, remove host and use nodeName consistently, rename metrics * fix statefulsets * fix typo. * fix another typo. * fix telemetry * fix casing issue * fix comma issue. * disable telemetry for rs ; fix stateful set name * worksround for namespace fix * telegraf integration - v1 * telemetry changes for telegraf * telemetry & other changes * remove custom metric regions as we dont need anymore * remove un-needed files * fixes * exclude certain volumes and fix telemetry to not have computer & nodename as dimensions (redundant) * Vishwa/resourcecentric (#208) (#209) * resourceid fix (for AKS only) * fix name * near final metric shape * change from customlog to fixed type (InsightsMetrics) * fix PR feedback * fix pr feedback * Fix telemetry error for telegraf err count metric (#215) * Fix Unscheduled Pod bug, remove excess telemetry (#218) * Fix Unscheduled Pod bug, remove excess telemetry * Send Success Telemetry only once after startup for a node in a cluster for MDM Post * Sending telemetry for successful push to MDM every hour * Merge from Vishwa/promstandardmetrics into ci_feature (#220) * enable prometheus metrics collection in replica-set * fixing typos * fix config file path for replicaset * fix configuration * config changes * merge config/settings to ci_feature (#221) * updating fluentbit to use LOG_TAIL_PATH * changes * log exclusion pattern * changes * removing comments * adding enviornment varibale collection/disable * disable env var for cluster variable change * changes * toml parser changes * adding directory tomlrb * changes for container inventory * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Telemetry for config overrides * add schema version telemetry * reduce the number of api calls for namespace filtering add more telemetry for config processing move liveness probe & parser to this repo * optimize for default kube-system namespace log collection exclusion * Fix Scenario when Controller name is empty (#222) * fix ; * ContainerLog collection optimizations (#223) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * merge final changes for release from Vishwa/june2019agentrel to ci_feature (#224) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * fix a minor comment * * change flush from 5 to 10 secs based on perf findings * fix fluent bit tuning for perf run (#226) * fix fluent bit tuning for perf run * stop collecting our own partition * fix merge issue * add release notes for june release in ci_feature branch * fix title * update * fix title * Trim spaces in AKS_REGION (#233) This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Add Logs Size To Telemetry (#234) * Add Logs to telemetry * Using len instead of unsafe.Sizeof * Merge Vishwa/promcustommetrics to ci_feature (#237) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * Fix Region space error (#239) * Trim spaces in AKS_REGION This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Fix out_mdm parsing error * Removing buffer chunk size and buffer max size from fluentbit conf (#240) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * removing buffer chunk size and buffer max size from fluentbit conf * changes (#243) * Collect container last state (#235) * updating the OMS agent to also collect container last state * changed a comment * git surrounded ContainerLastStatus code in a begin/rescue block * added a lot of error checking and logging * Rashmi/fix prom telemetry (#247) * fix prom telemetry * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Merge Health Model work into ci_feature behind a feature flag Pending perf testing (#246) Merge Health to ci_feature * Fix Deserialization Bug (#249) * Fix the bug where capacity is not updated and cached value was being used (#251) * Fix the Capacity computation * fix node cpu and memory limits calculation * changes (#250) * Added new Custom Metrics Regions, fixed MDM plugin crash bug (#253) Added new regions, added handler for MDM plugin start * Add Missing Handlers (#254) * Added Missing Handlers * Return MultiEventStream.new instead of empty array (#256) * Added explicit require_relative to avoid loading errors (#258) * Adding explicit require_relative * Gangams/enable ai telemetry in mc (#252) * enable ai telemetry to configure different ikey and endpoint per cloud * Fixing null check out_mdm bug, tomlparser bug, exposing Replica Set service name as an ENV variable (#261) * Expose replica set service as an env variable * Fixing null check out_mdm bug, and tomlparser bug * Updating the env variable name to be more specific to health model * Changes for creating custom plugins with namespace settings for prometheus scraping (#262) * changes * changes * changes * changes * changes * changes * chnages * changes * telemetry changes * changes * Cherry-pick hotfix 09092019 to ci_feature (#265) * Gangams/add telemetry hybrid (#264) * add telemetry to detect the cloud, distro and kernel version * add null check since providerId optional * detect azurestack cloud * rename to KubernetesProviderID since ProviderID name already used in LA * capture workspaceCloud to the telemetry * trim the domain read from file * KubeMonAgentEvents changes to collect configuration events (#267) * changes * changes * changes * changes * changes * changes * env changes * changes * changes * changes * reverting * changes * cahnges * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * chnages * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Fix the Dupe Perf Data Issue from the DaemonSet (#266) * Dupe Perf Record Fix * PR for 1. Container Memory CPU monitor 2. Configuration for Node Conditions 3. Fixed Type Changes 4. Use Env variable, and health_forward (that handles network errors at init) 5. Unit Tests (#268) * init containers fix and other bug fixes (#269) * init container - KPI and kubeperf changes * changes * changes * changes * changes for empty array fix * changes * changes * pod inventory exception fix * nil check changes * changes * fixing typo * changes * changes * PR - feedback * remove comment * tag pass changes * changes * tagdrop changes * changes * changes * Send agg monitor signal on details change (#270) send when an agg monitor details change, but state did not change * bug fixes for error (#274) * Fix to use declaration and assignment instead of assignment (#275) * bug fixes for error * adding declaration to assignment * 1. Added telemetry (#277) 2. Configuration property changes 3. Bug fixes for a. unscheduled pods returning green 3b. Sometimes, the details hash of agg monitors are different because the order of elements inside the array is different, causing the records to be sent * Bug fix to remove unused variable (#281) * bug fixes for error * adding declaration to assignment * removing unused variable * Fix the WARN<->WARNING typo (#282) * Bug Fixes 1. telemetry send throwing exception if records not initialized 2. permissions error in on-prem clusters (#284) * Bug fixes 1. not writeable, telemetry error * Change to state_WS_dir * Fix Require relative revert (#287) * Bug Fixes for exceptions in telemetry, remove limit set check (#289) * Bug Fixes 10222019 * Initialize container_cpu_memory_records in fhmb * Added telemetry to investigate health exceptions * Set frozen_string_literal to true * Send event once per container when lookup is empty, or limit is an array * Unit Tests, Use RS and POD to determine workload * Fixed Node Condition Bug, added exception handling to return get_rs_owner_ref * Fix the bug where if a warning condition appears before fail condition, the node condition is reported as warning instead of fail. Also fix the node conditions state to consider unknown as a failure state (#292) * Fix for Nodes Aspect not showing up in draft cluster (#294) * Fix the issue where the health tree is inconsistent if a deployment is deleted (#295) * Rashmi/1 16 test (#297) * health deployment update * apps v1 changes for deployment * changes * changes to use relicasets and api groups * Fix duplicate records in container memory/cpu samples (#298) * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * fix exceptions (#306) * Merge Branch morgan into ci_feature (#308) * Fixes : 1) Disable health (for time being) - in DS & RS 2) Disable MDM (for time being) - in DS & RS 3) Merge kubeperf into kubenode & kubepod 4) Made scheduling predictable for kubenode & kubepod 5) Enable containerlog enrichment fields (timeofcommand, containername & containerimage) as a configurable setting (default = true/ON) - Also add telemetry for it 6) Filter OUT type!=Normal events for k8s events 7) AppInsights telemetry async 8) Fix double calling bug in in_win_cadvisor_perf 9) Add connect timeout (20secs) & read timeout (40 secs) for all cadvisor api calls & also for all kubernetes api server calls 10) Fix batchTime for kubepods to be one before making api server call (rather than after making the call, which will make it fluctuate based on api server latency for the call) * fix setting issue for the new enrichcontainerlog setting * fix compilation issue * fix another compilation issue * fix emit issues * fix a nil issue * fix mising tag * * Fix all input plugins for scheduling issue * Merge kubeservices with kubepodinventory (reduce RS to API server by one more) * Remove Kubelogs (not used) * Fix liveness probe * Disable enrichment by default for container logs * Move to yajl json parser across the board for docker provier code * Remove unused files * fix removed files * fix timeofcommand and remove a duplicate entry for a health file. * Rashmi/http leak fixes (#301) * changes for http connection close * close socket in ensure * adding nil check * Rashmi/http leak fixes (#303) * changes for http connection close * close socket in ensure * adding nil check * adding missing end * use yajl for events & nodes parsing. * Rashmi/http leak fixes (#304) * changes for http connection close * close socket in ensure * adding nil check * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * adding missing end * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * changes for chunking * telemetry changes * some fixes * bug fix * changing to have morgan changes only * add new line * use polltime for metrics and disable out_forward for health * enable mdm & health * few optimizations * do not remove time of command make kube.conf same as scale tested config * remove comments from container.conf * remove flush comment for ai telemetry * remove commented code lines * fix config * remove timeofcommand when enrichment==false * fix config * enable mdm filter * Rashmi/api chunk (#307) * changes * changes * refactor changes * changes * changes * changes * changes * node changes * changes * changes * changes * changes * adding open and read timeouts for api client * removing comments * updating chunk size * Update Readme * add back timeofcommand (#310) * update readme for timeofcommand fix (#314) * Merge from ci_feature_prod into ci_feature (fix put back timeofcommand) (#311) (#316) * Updatng release history * fixing the plugin logs for emit stream * updating log message * Remove Log Processing from fluentd configuration * Remove plugin references from base_container.data * Dilipr/fluent bit log processing (#126) * Build out_oms.so and include in docker-cimprov package * Adding fluent-bit-config file to base container * PR Feedback * Adding out_oms.conf to base_container.data * PR Feedback * Making the critical section as small as possible * PR Feedback * Fixing the newline bug for Computer, and changing containerId to Id * Dilipr/glide updates (#127) * Updating glide.* files to include lumberjack * containerID="" for pull issues * Using KubeAPI for getting image,name. Adding more logs (#129) * Using KubeAPI for getting image,name. Adding more logs * Moving log file and state file to within the omsagent container * Changing log and state paths * Dilipr/mark comments (#130) * Marks Comments + Error Handling * Drop records from files that are not in k8s format * Remove unnecessary log line' * Adding Log to the file that doesn't conform to the expected format * Rashmi/segfault latest (#132) * adding null checks in all providers * fixing type * fixing type * adding more null checks * update cjson * Adding a missed null check (#135) * reusing some variables (#136) * Rashmi/cjson delete null check (#138) * adding null check for cjson-delete * null chk * removing null check * updating log level to debug for some provider workflows (#139) * Fixing CPU Utilization and removing Fluent-bit filters (#140) Removing fluent-bit filters, CPU optimizations * Minor tweaks 1. Remove some logging 2. Added more Error Handling 3. Continue when there is an error with k8s api (#141) * Removing some logs, added more error checking, continue on kube-api error * Return FLB OK for json Marshall error, instead of RETRY * * Change FluentBit flush interval to 30 secs (from 5 secs) * Remove ContainerPerf, ContainerServiceLog,ContainerProcess (OMI workflows) for Daemonset * Container Log Telemetry * Fixing an issue with Send Init Event if Telemetry is not initialized properly, tab to whitespace in conf file * PR feedback * PR feedback * Sending an event every 5 mins(Heartbeat) (#146) * PR feedback to cleanup removed workflows * updating agent version for telemetry * updating agent version * Telemetry Updates (#149) * Telemetry Fixes 1. Added Log Generation Rate 2. Fixed parsing bugs 3. Added code to send Exceptions/errors * PR Feedback * Changes to send omsagent/omsagent-rs kubectl logs to App Insights (#159) * Changes to send omsagent/omsagent-rs kubectl logs to App Insights * PR Feedback * Rashmi/fluentd docker inventory (#160) * first stab * changes * changes * docker util changes * working tested util * input plugin and conf * changes * changes * changes * changes * changes * working containerinventory * fixing omi removal from container.conf * removing comments * file write and read * deleted containers working * changes * changes * socket timeout * deleting test files * adding log * fixing comment * appinsights changes * changes * tel changes * changes * changes * changes * changes * lib changes * changes * changes * fixes * PR comments * changes * updating the ownership * changes * changes * changes to container data * removing comment * changes * adding collection time * bug fix * env string truncation * changes for acs-engine test * Fix Telemetry Bug -- Initialize Telemetry Client after Initializing all required properties (#162) * Fix kube events memory leak due to yaml serialization for > 5k events (#163) * Setting Timeout for HTTP Client in PostDataHelper in outoms go plugin(#164) * Vishwa/perftelemetry 2 (#165) * add cpu usage telemetry for ds & rs * add cpu & memory usage telemetry for ds & rs * environment variable fix (#166) * environment variable fix * updating agent version * Fixing a bug where we were crashing due to container statuses not present when not was lost (#167) * Updating title * updating right versions for last release * Updating the break condition to look for end of response (#168) * Updating the break condition to look for end of response * changes for docker response * updating AgentVersion for telemetry * Updating readme for latest release changes * Changes - (#173) * use /var/log for state * new metric ContainerLogsAgentSideLatencyMs * new field 'timeOfComand' * Rashmi/kubenodeinventory (#174) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * Get cpuusage from usageseconds (#175) * Rashmi/kubenodeinventory (#176) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * Rashmi/kubenodeinventory (#178) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * Fixing an issue on the cpurate metric, which happens for the first time (when cache is empty) (#179) * Rashmi/kubenodeinventory (#180) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Exclude docker containers from container inventory (#181) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * Exclude pauseamd64 containers from container inventory (#182) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * Update agent version * Updating readme for the latest release * Fix indentation in kube.conf and update readme (#184) * containernodeinventory changes * changes for containernodeinventory * changes to add node telemetry * pod telemetry cahnges * updated telemetry changes * changes to get uid of owner references as controller id * updating socket to the new mount location * Adding exception telemetry and heartbeat * changes to fix controller type * Fixing typo * fixing method signature * updating plugins to get controller type from env * fixing bugs * changes to fixed type * removing comments * changes for fixed type * adding kubelet version as a dimension * Excluding raw docker containers from container inventory * making labels key case insensitive * make poduid label case insensitive * changes to exclude pause amd 64 containers * fixing indentation so that kube.conf contents can be used in config map in the yaml * updating readme to fix date and agent version * updating agent tag * Get Pods for current Node Only (#185) * Fix KubeAPI Calls to filter to get pods for current node * Reinstate log line * changes for container node inventory fixed type (#186) * Fix for mooncake (disable telemetry optionally) (#191) * disable telemetry option * fix a typo * CustomMetrics to ci_feature (#193) Custom Metrics changes to ci_feature * add ContainerNotRunning column to KubePodInventory * merge pr feedback: update name to ContainerStatusReason * Zero Fill for Missing Pod Phases, Change Namespace Dimension to Kubernetes namespace, as it might be confused with metrics namespace in Metrics Explorer (#194) * Zero Fill for Pod Counts by Phase * Change namespace dimension to Kubernetes namespace * No Retries for non 404 4xx errors (#196) * Update agent version for telemetry * Update readme for upcoming (ciprod01202019) release * fix readme formatting * fix formatting for readme * fix formatting for readme * fix readme * fix readme * fix agent version for telemetry * fix date in readme * update readme * Restart logs every 10MB instead of weekly (#198) * Rotate logs every 10MB instead of weekly * Removing some logging, fixed log rotation * update agent version for telemetry * update readme * Update kube.conf to use %STATE_DIR_WS% instead of hardcoded path * Fix AKSEngine Crash (#200) * hotfix * close resp.Body * remove chatty logs * membuf=5m and ignore files not updated since 5 mins * fix readme for new version * Fix the pod count in mdm agent plugin (#203) * Update readme * string freeze for out_mdm plugin * Vishwa/resourcecentric (#208) * resourceid fix (for AKS only) * fix name * Rashmi/win nodepool - PR (#206) * changes for win nodes enumeration * changes * changes * changes * node cpu metric rate changes * container cpu rate * changes * changes * changes * changes * changes * changes to include in_win_cadvisor_perf.rb file * send containerinventoryheartbeatevent * changes * cahnges for mdm metrics * changes * cahnges * changes * container states * changes * changes * changes for env variables * changes * changes * changes * changes * delete comments * changes * mutex changes * changes * changes * changes * telemetry fix for docker version * removing hardcoded values for mdm * update docker version * telemetry for windows cadvisor timeouts * exeception key update to computer * PR comments * adding os to container inventory for windows nodes (#210) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors (#211) * Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors * Fixing the bug, deferring telemetry changes for later * updating to lowercase compare for units (#212) * Merge from vishwa/telegraftcp to ci_feature for telegraf changes (#214) * merge from Vishwa/telegraf to Vishwa/telegraftcp for telegraf changes (#207) * add configuration for telegraf * fix for perms * fix telegraf config. * fix file location & config * update to config * fix namespace * trying different namespace and also debug=true * add placeholder for nodename * change namespace * updated config * fix uri * fix azMon settings * remove aad settings * add custom metrics regions * fix config * add support for replica-set config * fix oomkilled * Add telegraf 403 metric telemetry & non 403 trace telemetry * fix type * fix package * fix package import * fix filename * delete unused file * conf file for rs; fix 403counttotal metric for telegraf, remove host and use nodeName consistently, rename metrics * fix statefulsets * fix typo. * fix another typo. * fix telemetry * fix casing issue * fix comma issue. * disable telemetry for rs ; fix stateful set name * worksround for namespace fix * telegraf integration - v1 * telemetry changes for telegraf * telemetry & other changes * remove custom metric regions as we dont need anymore * remove un-needed files * fixes * exclude certain volumes and fix telemetry to not have computer & nodename as dimensions (redundant) * Vishwa/resourcecentric (#208) (#209) * resourceid fix (for AKS only) * fix name * near final metric shape * change from customlog to fixed type (InsightsMetrics) * fix PR feedback * fix pr feedback * Fix telemetry error for telegraf err count metric (#215) * Fix Unscheduled Pod bug, remove excess telemetry (#218) * Fix Unscheduled Pod bug, remove excess telemetry * Send Success Telemetry only once after startup for a node in a cluster for MDM Post * Sending telemetry for successful push to MDM every hour * Merge from Vishwa/promstandardmetrics into ci_feature (#220) * enable prometheus metrics collection in replica-set * fixing typos * fix config file path for replicaset * fix configuration * config changes * merge config/settings to ci_feature (#221) * updating fluentbit to use LOG_TAIL_PATH * changes * log exclusion pattern * changes * removing comments * adding enviornment varibale collection/disable * disable env var for cluster variable change * changes * toml parser changes * adding directory tomlrb * changes for container inventory * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Telemetry for config overrides * add schema version telemetry * reduce the number of api calls for namespace filtering add more telemetry for config processing move liveness probe & parser to this repo * optimize for default kube-system namespace log collection exclusion * Fix Scenario when Controller name is empty (#222) * fix ; * ContainerLog collection optimizations (#223) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * merge final changes for release from Vishwa/june2019agentrel to ci_feature (#224) * * derive k8s namespace from file (rather than making a api call) * optimize perf by not tailing excluded namespaces in stdout & stderr * Tuning fluentbit settings based on Cortana teams findings * making db sync off * buffer chunk and max as 1m so that we dont flush > 1m payloads * increasing rotatte wait from 5 secs to 30 secs * decreasing refresh interval from 60 secs to 30 secs * adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying * changing flush to 5 secs from 30 secs * fix a minor comment * * change flush from 5 to 10 secs based on perf findings * fix fluent bit tuning for perf run (#226) * fix fluent bit tuning for perf run * stop collecting our own partition * fix merge issue * add release notes for june release in ci_feature branch * fix title * update * fix title * Trim spaces in AKS_REGION (#233) This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Add Logs Size To Telemetry (#234) * Add Logs to telemetry * Using len instead of unsafe.Sizeof * Merge Vishwa/promcustommetrics to ci_feature (#237) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * Fix Region space error (#239) * Trim spaces in AKS_REGION This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding * Fix out_mdm parsing error * Removing buffer chunk size and buffer max size from fluentbit conf (#240) * hard code config for UST CCP team * fix config * fix config after discussion * fix error log to get errros * fix config * update config * Add telemetry * Rashmi/promcustomconfig (#231) * changes * formatting changes * changes * changes * changes * changes * changes * changes * changes * changes * adding telemetry * changes * changes * changes * changes * changes * changes * changes * cahnges * changes * Rashmi/promcustomconfig (#236) * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * fix exceptions * changes to remove some exceptions * exception fixes * changes * changes for poduid nil check * removing buffer chunk size and buffer max size from fluentbit conf * changes (#243) * Collect container last state (#235) * updating the OMS agent to also collect container last state * changed a comment * git surrounded ContainerLastStatus code in a begin/rescue block * added a lot of error checking and logging * Rashmi/fix prom telemetry (#247) * fix prom telemetry * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Merge Health Model work into ci_feature behind a feature flag Pending perf testing (#246) Merge Health to ci_feature * Fix Deserialization Bug (#249) * Fix the bug where capacity is not updated and cached value was being used (#251) * Fix the Capacity computation * fix node cpu and memory limits calculation * changes (#250) * Added new Custom Metrics Regions, fixed MDM plugin crash bug (#253) Added new regions, added handler for MDM plugin start * Add Missing Handlers (#254) * Added Missing Handlers * Return MultiEventStream.new instead of empty array (#256) * Added explicit require_relative to avoid loading errors (#258) * Adding explicit require_relative * Gangams/enable ai telemetry in mc (#252) * enable ai telemetry to configure different ikey and endpoint per cloud * Fixing null check out_mdm bug, tomlparser bug, exposing Replica Set service name as an ENV variable (#261) * Expose replica set service as an env variable * Fixing null check out_mdm bug, and tomlparser bug * Updating the env variable name to be more specific to health model * Changes for creating custom plugins with namespace settings for prometheus scraping (#262) * changes * changes * changes * changes * changes * changes * chnages * changes * telemetry changes * changes * Cherry-pick hotfix 09092019 to ci_feature (#265) * Gangams/add telemetry hybrid (#264) * add telemetry to detect the cloud, distro and kernel version * add null check since providerId optional * detect azurestack cloud * rename to KubernetesProviderID since ProviderID name already used in LA * capture workspaceCloud to the telemetry * trim the domain read from file * KubeMonAgentEvents changes to collect configuration events (#267) * changes * changes * changes * changes * changes * changes * env changes * changes * changes * changes * reverting * changes * cahnges * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * chnages * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * changes * Fix the Dupe Perf Data Issue from the DaemonSet (#266) * Dupe Perf Record Fix * PR for 1. Container Memory CPU monitor 2. Configuration for Node Conditions 3. Fixed Type Changes 4. Use Env variable, and health_forward (that handles network errors at init) 5. Unit Tests (#268) * init containers fix and other bug fixes (#269) * init container - KPI and kubeperf changes * changes * changes * changes * changes for empty array fix * changes * changes * pod inventory exception fix * nil check changes * changes * fixing typo * changes * changes * PR - feedback * remove comment * tag pass changes * changes * tagdrop changes * changes * changes * Send agg monitor signal on details change (#270) send when an agg monitor details change, but state did not change * bug fixes for error (#274) * Fix to use declaration and assignment instead of assignment (#275) * bug fixes for error * adding declaration to assignment * 1. Added telemetry (#277) 2. Configuration property changes 3. Bug fixes for a. unscheduled pods returning green 3b. Sometimes, the details hash of agg monitors are different because the order of elements inside the array is different, causing the records to be sent * Bug fix to remove unused variable (#281) * bug fixes for error * adding declaration to assignment * removing unused variable * Fix the WARN<->WARNING typo (#282) * Bug Fixes 1. telemetry send throwing exception if records not initialized 2. permissions error in on-prem clusters (#284) * Bug fixes 1. not writeable, telemetry error * Change to state_WS_dir * Fix Require relative revert (#287) * Bug Fixes for exceptions in telemetry, remove limit set check (#289) * Bug Fixes 10222019 * Initialize container_cpu_memory_records in fhmb * Added telemetry to investigate health exceptions * Set frozen_string_literal to true * Send event once per container when lookup is empty, or limit is an array * Unit Tests, Use RS and POD to determine workload * Fixed Node Condition Bug, added exception handling to return get_rs_owner_ref * Fix the bug where if a warning condition appears before fail condition, the node condition is reported as warning instead of fail. Also fix the node conditions state to consider unknown as a failure state (#292) * Fix for Nodes Aspect not showing up in draft cluster (#294) * Fix the issue where the health tree is inconsistent if a deployment is deleted (#295) * Rashmi/1 16 test (#297) * health deployment update * apps v1 changes for deployment * changes * changes to use relicasets and api groups * Fix duplicate records in container memory/cpu samples (#298) * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * fix exceptions (#306) * Merge Branch morgan into ci_feature (#308) * Fixes : 1) Disable health (for time being) - in DS & RS 2) Disable MDM (for time being) - in DS & RS 3) Merge kubeperf into kubenode & kubepod 4) Made scheduling predictable for kubenode & kubepod 5) Enable containerlog enrichment fields (timeofcommand, containername & containerimage) as a configurable setting (default = true/ON) - Also add telemetry for it 6) Filter OUT type!=Normal events for k8s events 7) AppInsights telemetry async 8) Fix double calling bug in in_win_cadvisor_perf 9) Add connect timeout (20secs) & read timeout (40 secs) for all cadvisor api calls & also for all kubernetes api server calls 10) Fix batchTime for kubepods to be one before making api server call (rather than after making the call, which will make it fluctuate based on api server latency for the call) * fix setting issue for the new enrichcontainerlog setting * fix compilation issue * fix another compilation issue * fix emit issues * fix a nil issue * fix mising tag * * Fix all input plugins for scheduling issue * Merge kubeservices with kubepodinventory (reduce RS to API server by one more) * Remove Kubelogs (not used) * Fix liveness probe * Disable enrichment by default for container logs * Move to yajl json parser across the board for docker provier code * Remove unused files * fix removed files * fix timeofcommand and remove a duplicate entry for a health file. * Rashmi/http leak fixes (#301) * changes for http connection close * close socket in ensure * adding nil check * Rashmi/http leak fixes (#303) * changes for http connection close * close socket in ensure * adding nil check * adding missing end * use yajl for events & nodes parsing. * Rashmi/http leak fixes (#304) * changes for http connection close * close socket in ensure * adding nil check * Update MDM region list to include francecentral, japaneast and australiaeast * Update MDM region list to include francecentral, japaneast and australiaeast * adding missing end * Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (#300) * changes for chunking * telemetry changes * some fixes * bug fix * changing to have morgan changes only * add new line * use polltime for metrics and disable out_forward for health * enable mdm & health * few optimizations * do not remove time of command make kube.conf same as scale tested config * remove comments from container.conf * remove flush comment for ai telemetry * remove commented code lines * fix config * remove timeofcommand when enrichment==false * fix config * enable mdm filter * Rashmi/api chunk (#307) * changes * changes * refactor changes * changes * changes * changes * changes * node changes * changes * changes * changes * changes * adding open and read timeouts for api client * removing comments * updating chunk size * Update Readme * add back timeofcommand (#310) * Adding new cpu and memory limits to readme * CAdvisor to use 10255/10250 based on env variable (#321) * CAdvisor secure port changes (#320) * cadvsior secure port changes * update to use secure/insecure port for cadvisor * telemetry changes * fix bug * bug fix * changes * Adding cadvisor uri log * switching defaults * update readme * changes * changing font for code change and customer impact * For ARO, stop collecting inventory of master and infra (#323) * filter out infra and master nodes inventory for aro * filterout pods info scheduled master and infra nodes * fix redundant KubernetesApiClient name * filter out events sourced from master and infra nodes * fix in kubeapi * add the comments * fix pr feedback * minor updates * fix pr feedback * encode special characters in query * some refactoring * MDM plugin support for large scale clusters (#324) * Batch Commit * WIP: Committing move logic from filter to input * WIP : MDM plugins for scale clusters * Bug fixes 1. cpu percentage 2. bytesize on array. Remove log line * Fixing metric value in cadvisor2mdm plugin * WIP to laptop * Working version with cadvisor changes * Fix Health cpu usage * Added uri for cadvisor failure * Add Null check for kube api responses in in_kube_health (#325) * Fix casing bug (#326) * Missed kube.conf update (#327) * changes to use msi if service principal does not exist (#328) changes to use msi if service principal does not exist (#328) * Adding caseinsensitive compare (#330) Adding case insensitive compare * gpu monitoring (#329) * gpu monitoring * Emit info log for tests for the new insightsmetrics data stream * Update release notes * MDM batch Bug (#336) * kube evnts bug fix (#335) * Update readme.md * Rate Limiting changes (#338) * Add header for throttling Add telemetry for throttling Flush 15secs for telegraf metrics * * Add requestid & log it for errors * Fix bug in semver * Gangams/add support cri runtime docker env (#337) * trim log tag for cri compatible log lines * fix variable declaration * debug log messages * trim spaces * remove debug logs * fix build error * add pods api in cadvisor * wip * wip * wip * wip * fix bug * fix bug * fix bug with end * fix bug in the reference * fix bug * fix telemetry * fix image and imageid bug * fix containerid bug * fix envvars * fix syntax error * fix syntax error * fix env and command * handle labels and annotation fieldref as envs * wip * implement obtain envvars * refactor the code * fix log warn * add more logging * fix npe * revert kubelet api to get the envvars * fix minor issue * fix log message * comment valueFrom for now since this causing some issue * fix formatting issue * fix bug * add resourcefieldref support in env * include init containers as well * fix bug * fix typo * use chomp instead of delete_suffix * add secretkeyref support * add secretkeyref support * fix bug * handle dockerversion for non-docker runtimes * fix comment * handle kubelet metrics to handle runtime and k8s versions * remove _total metrics until we support 1.18 * fetch envvars via proc environ * fix undefined vars * fix undefined vars * add init containers * split on null character * consider crio containers main proceses for envvars * unschedule pods or containers with pull issues * fix issue * fix feedback * update to azmon parsers * fix file extension * add some debug messages * fix timestamp issue * fix pr feedback * refactor linux containerenvvars code * refactor windows container inventory code * fix rb file load error from in_kube_node inventory * add new rb file base_container.data * turnoff docker container inventory * cleanup debug logs * fix npe in log message * avoid environ for terminated containers * fix pr feedback * remove empty file * clean up debug logs * Gangams/fix cri exceptions (#339) * fix nil exception * add exception logging message * fix npe * fix log exception (#340) * Updating release notes, mdm bug fix (#333) (#343) Updating release notes, mdm bug fix for release Co-authored-by: rashmichandrashekar <rashmy@microsoft.com> Co-authored-by: Vishwanath Narasimhan <visnara@microsoft.com> Co-authored-by: rashmy <rashmy@RASHMY-PC2> Co-authored-by: r-dilip <dilip.rangarajan@gmail.com> Co-authored-by: rashmichandrashekar <rashmy@microsoft.com> Co-authored-by: Kaveesh Dubey <kadubey@microsoft.com> Co-authored-by: David Michelman <daweim0@gmail.com> Co-authored-by: Dilip Raghunathan <dilipr@microsoft.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
…d) (#311)
Updatng release history
fixing the plugin logs for emit stream
updating log message
Remove Log Processing from fluentd configuration
Remove plugin references from base_container.data
Dilipr/fluent bit log processing (Dilipr/fluent bit log processing #126)
Build out_oms.so and include in docker-cimprov package
Adding fluent-bit-config file to base container
PR Feedback
Adding out_oms.conf to base_container.data
PR Feedback
Making the critical section as small as possible
PR Feedback
Fixing the newline bug for Computer, and changing containerId to Id
Dilipr/glide updates (Dilipr/glide updates #127)
Updating glide.* files to include lumberjack
containerID="" for pull issues
Using KubeAPI for getting image,name. Adding more logs (Using KubeAPI for getting image,name. Adding more logs #129)
Using KubeAPI for getting image,name. Adding more logs
Moving log file and state file to within the omsagent container
Changing log and state paths
Dilipr/mark comments (Dilipr/mark comments #130)
Marks Comments + Error Handling
Drop records from files that are not in k8s format
Remove unnecessary log line'
Adding Log to the file that doesn't conform to the expected format
Rashmi/segfault latest (Rashmi/segfault latest #132)
adding null checks in all providers
fixing type
fixing type
adding more null checks
update cjson
Adding a missed null check (Adding a missed null check #135)
reusing some variables (reusing some variables #136)
Rashmi/cjson delete null check (Rashmi/cjson delete null check #138)
adding null check for cjson-delete
null chk
removing null check
updating log level to debug for some provider workflows (updating log level to debug for some provider workflows #139)
Fixing CPU Utilization and removing Fluent-bit filters (Fixing CPU Utilization and removing Fluent-bit filters #140)
Removing fluent-bit filters, CPU optimizations
Minor tweaks 1. Remove some logging 2. Added more Error Handling 3. Continue when there is an error with k8s api (Minor tweaks 1. Remove some logging 2. Added more Error Handling 3. Continue when there is an error with k8s api #141)
Removing some logs, added more error checking, continue on kube-api error
Return FLB OK for json Marshall error, instead of RETRY
Remove ContainerPerf, ContainerServiceLog,ContainerProcess (OMI workflows) for Daemonset
Container Log Telemetry
Fixing an issue with Send Init Event if Telemetry is not initialized properly, tab to whitespace in conf file
PR feedback
PR feedback
Sending an event every 5 mins(Heartbeat) (Send Event Telemetry, Remove Config for Buffer_Chunk_Size, Buffer_Max_Size #146)
PR feedback to cleanup removed workflows
updating agent version for telemetry
updating agent version
Telemetry Updates (Telemetry Updates #149)
Telemetry Fixes 1. Added Log Generation Rate 2. Fixed parsing bugs 3. Added code to send Exceptions/errors
PR Feedback
Changes to send omsagent/omsagent-rs kubectl logs to App Insights (Changes to send omsagent/omsagent-rs kubectl logs to App Insights #159)
Changes to send omsagent/omsagent-rs kubectl logs to App Insights
PR Feedback
Rashmi/fluentd docker inventory (Rashmi/fluentd docker inventory #160)
first stab
changes
changes
docker util changes
working tested util
input plugin and conf
changes
changes
changes
changes
changes
working containerinventory
fixing omi removal from container.conf
removing comments
file write and read
deleted containers working
changes
changes
socket timeout
deleting test files
adding log
fixing comment
appinsights changes
changes
tel changes
changes
changes
changes
changes
lib changes
changes
changes
fixes
PR comments
changes
updating the ownership
changes
changes
changes to container data
removing comment
changes
adding collection time
bug fix
env string truncation
changes for acs-engine test
Fix Telemetry Bug -- Initialize Telemetry Client after Initializing all required properties (Fix Telemetry Bug -- Computer is blank #162)
Fix kube events memory leak due to yaml serialization for > 5k events (Fix kube events memory leak due to yaml serialization for > 5k events #163)
Setting Timeout for HTTP Client in PostDataHelper in outoms go plugin(Setting timeout for HTTP Client #164)
Vishwa/perftelemetry 2 (Vishwa/perftelemetry 2 #165)
add cpu usage telemetry for ds & rs
add cpu & memory usage telemetry for ds & rs
environment variable fix (environment variable fix #166)
environment variable fix
updating agent version
Fixing a bug where we were crashing due to container statuses not present when not was lost (Fixing a bug where we were crashing due to container statuses not pre… #167)
Updating title
updating right versions for last release
Updating the break condition to look for end of response (Updating the break condition to look for end of response #168)
Updating the break condition to look for end of response
changes for docker response
updating AgentVersion for telemetry
Updating readme for latest release changes
Changes - (multiple fixes #173)
use /var/log for state
new metric ContainerLogsAgentSideLatencyMs
new field 'timeOfComand'
Rashmi/kubenodeinventory (Rashmi/kubenodeinventory #174)
containernodeinventory changes
changes for containernodeinventory
changes to add node telemetry
pod telemetry cahnges
updated telemetry changes
changes to get uid of owner references as controller id
Get cpuusage from usageseconds (Get cpuusage from usageseconds #175)
Rashmi/kubenodeinventory (Rashmi/kubenodeinventory #176)
containernodeinventory changes
changes for containernodeinventory
changes to add node telemetry
pod telemetry cahnges
updated telemetry changes
changes to get uid of owner references as controller id
updating socket to the new mount location
Adding exception telemetry and heartbeat
changes to fix controller type
Fixing typo
fixing method signature
updating plugins to get controller type from env
fixing bugs
Rashmi/kubenodeinventory (Rashmi/kubenodeinventory #178)
containernodeinventory changes
changes for containernodeinventory
changes to add node telemetry
pod telemetry cahnges
updated telemetry changes
changes to get uid of owner references as controller id
updating socket to the new mount location
Adding exception telemetry and heartbeat
changes to fix controller type
Fixing typo
fixing method signature
updating plugins to get controller type from env
fixing bugs
changes to fixed type
removing comments
changes for fixed type
Fixing an issue on the cpurate metric, which happens for the first time (when cache is empty) (Fixing an issue on the cpurate metric, which happens for the first ti… #179)
Rashmi/kubenodeinventory (Rashmi/kubenodeinventory #180)
containernodeinventory changes
changes for containernodeinventory
changes to add node telemetry
pod telemetry cahnges
updated telemetry changes
changes to get uid of owner references as controller id
updating socket to the new mount location
Adding exception telemetry and heartbeat
changes to fix controller type
Fixing typo
fixing method signature
updating plugins to get controller type from env
fixing bugs
changes to fixed type
removing comments
changes for fixed type
adding kubelet version as a dimension
Exclude docker containers from container inventory (Exclude docker containers from container inventory #181)
containernodeinventory changes
changes for containernodeinventory
changes to add node telemetry
pod telemetry cahnges
updated telemetry changes
changes to get uid of owner references as controller id
updating socket to the new mount location
Adding exception telemetry and heartbeat
changes to fix controller type
Fixing typo
fixing method signature
updating plugins to get controller type from env
fixing bugs
changes to fixed type
removing comments
changes for fixed type
adding kubelet version as a dimension
Excluding raw docker containers from container inventory
making labels key case insensitive
make poduid label case insensitive
Exclude pauseamd64 containers from container inventory (Exclude pauseamd64 containers from container inventory #182)
containernodeinventory changes
changes for containernodeinventory
changes to add node telemetry
pod telemetry cahnges
updated telemetry changes
changes to get uid of owner references as controller id
updating socket to the new mount location
Adding exception telemetry and heartbeat
changes to fix controller type
Fixing typo
fixing method signature
updating plugins to get controller type from env
fixing bugs
changes to fixed type
removing comments
changes for fixed type
adding kubelet version as a dimension
Excluding raw docker containers from container inventory
making labels key case insensitive
make poduid label case insensitive
changes to exclude pause amd 64 containers
Update agent version
Updating readme for the latest release
Fix indentation in kube.conf and update readme (Fix indentation in kube.conf and update readme #184)
containernodeinventory changes
changes for containernodeinventory
changes to add node telemetry
pod telemetry cahnges
updated telemetry changes
changes to get uid of owner references as controller id
updating socket to the new mount location
Adding exception telemetry and heartbeat
changes to fix controller type
Fixing typo
fixing method signature
updating plugins to get controller type from env
fixing bugs
changes to fixed type
removing comments
changes for fixed type
adding kubelet version as a dimension
Excluding raw docker containers from container inventory
making labels key case insensitive
make poduid label case insensitive
changes to exclude pause amd 64 containers
fixing indentation so that kube.conf contents can be used in config map in the yaml
updating readme to fix date and agent version
updating agent tag
Get Pods for current Node Only (Get Pods for current Node Only #185)
Fix KubeAPI Calls to filter to get pods for current node
Reinstate log line
changes for container node inventory fixed type (changes for container node inventory fixed type #186)
Fix for mooncake (disable telemetry optionally) (Fix for mooncake (disable telemetry optionally) #191)
disable telemetry option
fix a typo
CustomMetrics to ci_feature (CustomMetrics to ci_feature #193)
Custom Metrics changes to ci_feature
add ContainerNotRunning column to KubePodInventory
merge pr feedback: update name to ContainerStatusReason
Zero Fill for Missing Pod Phases, Change Namespace Dimension to Kubernetes namespace, as it might be confused with metrics namespace in Metrics Explorer (Zero Fill for Missing Pod Phases, Change Namespace Dimension to Kubernetes namespace, as it might be confused with metrics namespace in Metrics Explorer #194)
Zero Fill for Pod Counts by Phase
Change namespace dimension to Kubernetes namespace
No Retries for non 404 4xx errors (Dont Retry for non403 4xx errors #196)
Update agent version for telemetry
Update readme for upcoming (ciprod01202019) release
fix readme formatting
fix formatting for readme
fix formatting for readme
fix readme
fix readme
fix agent version for telemetry
fix date in readme
update readme
Restart logs every 10MB instead of weekly (Restart logs every 10MB instead of weekly #198)
Rotate logs every 10MB instead of weekly
Removing some logging, fixed log rotation
update agent version for telemetry
update readme
Update kube.conf to use %STATE_DIR_WS% instead of hardcoded path
Fix AKSEngine Crash (Fix AKS Engine Crash #200)
hotfix
close resp.Body
remove chatty logs
membuf=5m and ignore files not updated since 5 mins
fix readme for new version
Fix the pod count in mdm agent plugin (Fix the Pod Inventory bug #203)
Update readme
string freeze for out_mdm plugin
Vishwa/resourcecentric (Vishwa/resourcecentric #208)
resourceid fix (for AKS only)
fix name
Rashmi/win nodepool - PR (Rashmi/win nodepool - PR #206)
changes for win nodes enumeration
changes
changes
changes
node cpu metric rate changes
container cpu rate
changes
changes
changes
changes
changes
changes to include in_win_cadvisor_perf.rb file
send containerinventoryheartbeatevent
changes
cahnges for mdm metrics
changes
cahnges
changes
container states
changes
changes
changes for env variables
changes
changes
changes
changes
delete comments
changes
mutex changes
changes
changes
changes
telemetry fix for docker version
removing hardcoded values for mdm
update docker version
telemetry for windows cadvisor timeouts
exeception key update to computer
PR comments
adding os to container inventory for windows nodes (adding os to container inventory telemetry for windows nodes #210)
Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors (Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors #211)
Fix omsagent crash Error when kube-api returns non-200, send events for HTTP Errors
Fixing the bug, deferring telemetry changes for later
updating to lowercase compare for units (updating to lowercase compare for units #212)
Merge from vishwa/telegraftcp to ci_feature for telegraf changes (Merge from vishwa/telegraftcp to ci_feature for telegraf changes #214)
merge from Vishwa/telegraf to Vishwa/telegraftcp for telegraf changes (merge from Vishwa/telegraf to Vishwa/telegraftcp for telegraf changes #207)
add configuration for telegraf
fix for perms
fix telegraf config.
fix file location & config
update to config
fix namespace
trying different namespace and also debug=true
add placeholder for nodename
change namespace
updated config
fix uri
fix azMon settings
remove aad settings
add custom metrics regions
fix config
add support for replica-set config
fix oomkilled
Add telegraf 403 metric telemetry & non 403 trace telemetry
fix type
fix package
fix package import
fix filename
delete unused file
conf file for rs; fix 403counttotal metric for telegraf, remove host and use nodeName consistently, rename metrics
fix statefulsets
fix typo.
fix another typo.
fix telemetry
fix casing issue
fix comma issue.
disable telemetry for rs ; fix stateful set name
worksround for namespace fix
telegraf integration - v1
telemetry changes for telegraf
telemetry & other changes
remove custom metric regions as we dont need anymore
remove un-needed files
fixes
exclude certain volumes and fix telemetry to not have computer & nodename as dimensions (redundant)
Vishwa/resourcecentric (Vishwa/resourcecentric #208) (merging from ci_feature into vishwa/telegraftcp #209)
resourceid fix (for AKS only)
fix name
near final metric shape
change from customlog to fixed type (InsightsMetrics)
fix PR feedback
fix pr feedback
Fix telemetry error for telegraf err count metric (Fix telemetry error for telegraf err count metric #215)
Fix Unscheduled Pod bug, remove excess telemetry (Fix Unscheduled Pod bug, remove excess telemetry #218)
Fix Unscheduled Pod bug, remove excess telemetry
Send Success Telemetry only once after startup for a node in a cluster for MDM Post
Sending telemetry for successful push to MDM every hour
Merge from Vishwa/promstandardmetrics into ci_feature (Merge from Vishwa/promstandardmetrics into ci_feature #220)
enable prometheus metrics collection in replica-set
fixing typos
fix config file path for replicaset
fix configuration
config changes
merge config/settings to ci_feature (merge config/settings to ci_feature #221)
updating fluentbit to use LOG_TAIL_PATH
changes
log exclusion pattern
changes
removing comments
adding enviornment varibale collection/disable
disable env var for cluster variable change
changes
toml parser changes
adding directory tomlrb
changes for container inventory
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
Telemetry for config overrides
add schema version telemetry
reduce the number of api calls for namespace filtering
add more telemetry for config processing
move liveness probe & parser to this repo
optimize for default kube-system namespace log collection exclusion
Fix Scenario when Controller name is empty (Fix scenario where controller name is empty #222)
fix ;
ContainerLog collection optimizations (ContainerLog collection optimizations #223)
optimize perf by not tailing excluded namespaces in stdout & stderr
Tuning fluentbit settings based on Cortana teams findings
making db sync off
buffer chunk and max as 1m so that we dont flush > 1m payloads
increasing rotatte wait from 5 secs to 30 secs
decreasing refresh interval from 60 secs to 30 secs
adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying
changing flush to 5 secs from 30 secs
merge final changes for release from Vishwa/june2019agentrel to ci_feature (merge final changes for release from Vishwa/june2019agentrel to ci_feature #224)
optimize perf by not tailing excluded namespaces in stdout & stderr
Tuning fluentbit settings based on Cortana teams findings
making db sync off
buffer chunk and max as 1m so that we dont flush > 1m payloads
increasing rotatte wait from 5 secs to 30 secs
decreasing refresh interval from 60 secs to 30 secs
adding retry limit as 10 so that items get dropped in 50 secs rather than infinetely trying
changing flush to 5 secs from 30 secs
fix a minor comment
fix fluent bit tuning for perf run (fix fluent bit tuning for perf run #226)
fix fluent bit tuning for perf run
stop collecting our own partition
fix merge issue
add release notes for june release in ci_feature branch
fix title
update
fix title
Trim spaces in AKS_REGION (Fix Region Spacing bug for backdoor agent onboarding scenarios #233)
This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding
Add Logs Size To Telemetry (Add Logs Size To Telemetry #234)
Add Logs to telemetry
Using len instead of unsafe.Sizeof
Merge Vishwa/promcustommetrics to ci_feature (Merge Vishwa/promcustommetrics to ci_feature #237)
hard code config for UST CCP team
fix config
fix config after discussion
fix error log to get errros
fix config
update config
Add telemetry
Rashmi/promcustomconfig (Rashmi/promcustomconfig #231)
changes
formatting changes
changes
changes
changes
changes
changes
changes
changes
changes
adding telemetry
changes
changes
changes
changes
changes
changes
changes
cahnges
changes
Rashmi/promcustomconfig (Rashmi/promcustomconfig #236)
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
fix exceptions
changes to remove some exceptions
exception fixes
changes
changes for poduid nil check
Fix Region space error (Fix Region space error #239)
Trim spaces in AKS_REGION
This is not an issue for normal AKS Monitoring Addon Onboarding. ONLY an issue for backdoor onboarding
Fix out_mdm parsing error
Removing buffer chunk size and buffer max size from fluentbit conf (Removing buffer chunk size and buffer max size from fluentbit conf #240)
hard code config for UST CCP team
fix config
fix config after discussion
fix error log to get errros
fix config
update config
Add telemetry
Rashmi/promcustomconfig (Rashmi/promcustomconfig #231)
changes
formatting changes
changes
changes
changes
changes
changes
changes
changes
changes
adding telemetry
changes
changes
changes
changes
changes
changes
changes
cahnges
changes
Rashmi/promcustomconfig (Rashmi/promcustomconfig #236)
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
fix exceptions
changes to remove some exceptions
exception fixes
changes
changes for poduid nil check
removing buffer chunk size and buffer max size from fluentbit conf
changes (Fixed poduid bug in kube perf #243)
Collect container last state (Collect container last state #235)
updating the OMS agent to also collect container last state
changed a comment
git surrounded ContainerLastStatus code in a begin/rescue block
added a lot of error checking and logging
Rashmi/fix prom telemetry (Rashmi/fix prom telemetry #247)
fix prom telemetry
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
Merge Health Model work into ci_feature behind a feature flag Pending perf testing (Merge Health Model work into ci_feature behind a feature flag Pending perf testing #246)
Merge Health to ci_feature
Fix Deserialization Bug (Fix State Persistence Bug #249)
Fix the bug where capacity is not updated and cached value was being used (Fix the bug where capacity is not updated and cached value was being used #251)
Fix the Capacity computation
fix node cpu and memory limits calculation
changes (Fixing the plugin crash when docker sock is unavailable #250)
Added new Custom Metrics Regions, fixed MDM plugin crash bug (Added new Custom Metrics Regions, fixed MDM plugin crash bug #253)
Added new regions, added handler for MDM plugin start
Add Missing Handlers (Add Missing Handlers #254)
Added Missing Handlers
Return MultiEventStream.new instead of empty array (Return MultiEventStream.new instead of empty array #256)
Added explicit require_relative to avoid loading errors (Added explicit require_relative to avoid loading errors #258)
Adding explicit require_relative
Gangams/enable ai telemetry in mc (Gangams/enable ai telemetry in mc #252)
enable ai telemetry to configure different ikey and endpoint per cloud
Fixing null check out_mdm bug, tomlparser bug, exposing Replica Set service name as an ENV variable (Fixing null check out_mdm bug, tomlparser bug, exposing Replica Set service name as an ENV variable #261)
Expose replica set service as an env variable
Fixing null check out_mdm bug, and tomlparser bug
Updating the env variable name to be more specific to health model
Changes for creating custom plugins with namespace settings for prometheus scraping (Changes for creating custom plugins with namespace settings for prometheus scraping #262)
changes
changes
changes
changes
changes
changes
chnages
changes
telemetry changes
changes
Cherry-pick hotfix 09092019 to ci_feature (Cherry-pick hotfix 09092019 to ci_feature #265)
Gangams/add telemetry hybrid (Gangams/add telemetry hybrid #264)
add telemetry to detect the cloud, distro and kernel version
add null check since providerId optional
detect azurestack cloud
rename to KubernetesProviderID since ProviderID name already used in LA
capture workspaceCloud to the telemetry
trim the domain read from file
KubeMonAgentEvents changes to collect configuration events (KubeMonAgentEvents changes to collect configuration events #267)
changes
changes
changes
changes
changes
changes
env changes
changes
changes
changes
reverting
changes
cahnges
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
chnages
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
changes
Fix the Dupe Perf Data Issue from the DaemonSet (Fix the Dupe Perf Data Issue from the DaemonSet #266)
Dupe Perf Record Fix
PR for 1. Container Memory CPU monitor 2. Configuration for Node Conditions 3. Fixed Type Changes 4. Use Env variable, and health_forward (that handles network errors at init) 5. Unit Tests (PR for 1. Container Memory CPU monitor 2. Configuration for Node Conditions 3. Fixed Type Changes 4. Use Env variable, and health_forward (that handles network errors at init) 5. Unit Tests #268)
init containers fix and other bug fixes (init containers fix and other bug fixes #269)
init container - KPI and kubeperf changes
changes
changes
changes
changes for empty array fix
changes
changes
pod inventory exception fix
nil check changes
changes
fixing typo
changes
changes
PR - feedback
remove comment
tag pass changes
changes
tagdrop changes
changes
changes
Send agg monitor signal on details change (Send agg monitor signal on details change #270)
send when an agg monitor details change, but state did not change
bug fixes for error (bug fixes for error #274)
Fix to use declaration and assignment instead of assignment (Fix to use declaration and assignment instead of assignment #275)
bug fixes for error
adding declaration to assignment
3b. Sometimes, the details hash of agg monitors are different because the order of elements inside the array is different, causing the records to be sent
Bug fix to remove unused variable (Bug fix to remove unused variable #281)
bug fixes for error
adding declaration to assignment
removing unused variable
Fix the WARN<->WARNING typo (FIX WARN WARNINGTypo #282)
Bug Fixes 1. telemetry send throwing exception if records not initialized 2. permissions error in on-prem clusters (Bug Fixes 1. telemetry send throwing exception if records not initialized 2. permissions error in on-prem clusters #284)
Bug fixes 1. not writeable, telemetry error
Change to state_WS_dir
Fix Require relative revert (Fix the revert that removed explicit require #287)
Bug Fixes for exceptions in telemetry, remove limit set check (Bug Fixes for exceptions in telemetry, remove limit set check #289)
Bug Fixes 10222019
Initialize container_cpu_memory_records in fhmb
Added telemetry to investigate health exceptions
Set frozen_string_literal to true
Send event once per container when lookup is empty, or limit is an array
Unit Tests, Use RS and POD to determine workload
Fixed Node Condition Bug, added exception handling to return get_rs_owner_ref
Fix the bug where if a warning condition appears before fail condition, the node condition is reported as warning instead of fail. Also fix the node conditions state to consider unknown as a failure state (Fix Node Condition bug #292)
Fix for Nodes Aspect not showing up in draft cluster (Fix for Nodes Aspect not showing up in draft cluster #294)
Fix the issue where the health tree is inconsistent if a deployment is deleted (Fix the health tree inconsistency when a deployment is deleted #295)
Rashmi/1 16 test (Rashmi/1 16 test #297)
health deployment update
apps v1 changes for deployment
changes
changes to use relicasets and api groups
Fix duplicate records in container memory/cpu samples (Fix duplicate records in container memory/cpu samples #298)
Update MDM region list to include francecentral, japaneast and australiaeast
Update MDM region list to include francecentral, japaneast and australiaeast
Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown #300)
fix exceptions (fix exceptions #306)
Merge Branch morgan into ci_feature (Merge Branch morgan into ci_feature #308)
Fixes :
fix setting issue for the new enrichcontainerlog setting
fix compilation issue
fix another compilation issue
fix emit issues
fix a nil issue
fix mising tag
Merge kubeservices with kubepodinventory (reduce RS to API server by one more)
Remove Kubelogs (not used)
Fix liveness probe
Disable enrichment by default for container logs
Move to yajl json parser across the board for docker provier code
Remove unused files
fix removed files
fix timeofcommand and remove a duplicate entry for a health file.
Rashmi/http leak fixes (Rashmi/http leak fixes #301)
changes for http connection close
close socket in ensure
adding nil check
Rashmi/http leak fixes (Rashmi/http leak fixes #303)
changes for http connection close
close socket in ensure
adding nil check
adding missing end
use yajl for events & nodes parsing.
Rashmi/http leak fixes (Rashmi/http leak fixes #304)
changes for http connection close
close socket in ensure
adding nil check
Update MDM region list to include francecentral, japaneast and australiaeast
Update MDM region list to include francecentral, japaneast and australiaeast
adding missing end
Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown (Send telemetry when there is error in calculation of state in percentage aggregation, and send state as unknown #300)
changes for chunking
telemetry changes
some fixes
bug fix
changing to have morgan changes only
add new line
use polltime for metrics and disable out_forward for health
enable mdm & health
few optimizations
do not remove time of command
make kube.conf same as scale tested config
remove comments from container.conf
remove flush comment for ai telemetry
remove commented code lines
fix config
remove timeofcommand when enrichment==false
fix config
enable mdm filter
Rashmi/api chunk (Rashmi/api chunk #307)
changes
changes
refactor changes
changes
changes
changes
changes
node changes
changes
changes
changes
changes
adding open and read timeouts for api client
removing comments
updating chunk size
Update Readme
add back timeofcommand (Merge into ci_feature (fix: add back timeofcommand) #310)